BERT and RoBERTa in R: Transformer-Based NLP

Author

Martin Schweinberger

Introduction

This tutorial introduces BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly Optimised BERT Pretraining Approach) and demonstrates how to apply them to a range of NLP tasks in R. Both models produce rich, context-sensitive representations of language and can be adapted to many downstream tasks — including sentiment analysis, named entity recognition, question answering, and custom classification — with relatively little task-specific data.

The tutorial covers the conceptual architecture of both models, a guide to the two main R interfaces (text and reticulate), a candid discussion of R’s dependence on Python for transformer inference, and hands-on workflows for five core tasks. A model comparison section at the end helps you choose the right model and tool for your research.

Prerequisite Tutorials

Before working through this tutorial, you should be comfortable with:

Learning Objectives

By the end of this tutorial you will be able to:

  1. Explain how BERT and RoBERTa work and how they differ
  2. Understand why R transformer packages require Python and how to set up a working environment
  3. Choose between the text and reticulate interfaces for different tasks
  4. Extract contextualised sentence embeddings using text with BERT
  5. Perform text classification using reticulate with RoBERTa
  6. Run named entity recognition using a HuggingFace NER pipeline
  7. Apply extractive question answering using text
  8. Fine-tune RoBERTa on a custom classification dataset
  9. Compare DistilBERT, BERT-base, and RoBERTa-base across tasks
Citation

Schweinberger, Martin. 2026. BERT and RoBERTa in R: Transformer-Based NLP. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/bert/bert.html (Version 2026.05.01).

Code in This Tutorial Is Not Executed

Unlike other LADAL tutorials, the code chunks in this tutorial are displayed but not run during knitting (eval=FALSE throughout). There are two reasons for this.

First, running BERT and RoBERTa requires loading large neural network models (260–480 MB each) and executing computationally intensive operations that can take minutes to hours depending on available hardware — making automatic execution during knitting impractical.

Second, all transformer computation in R depends on Python running in the background via reticulate. The Python environment must be configured with Sys.setenv(RETICULATE_PYTHON = ...) before any R package is loaded in a session. This constraint means that no Python-dependent code can be safely executed in a knitted document without a pre-configured environment on the build server.

To run the code yourself, work through the tutorial interactively in RStudio: complete the setup steps in Section 4 first, then execute chunks manually in order. All code is fully functional when run this way in a session where the r-transformers virtualenv has been configured.

How BERT and RoBERTa Work

Section Overview

What you will learn: The limits of earlier NLP approaches and why transformer models supersede them; how BERT’s bidirectional self-attention and WordPiece tokenisation work; what the [CLS] and [SEP] tokens do; BERT’s pre-training objectives; how RoBERTa improves on BERT through training optimisation; and the transfer learning paradigm shared by both models

The Problem These Models Solve

Earlier NLP approaches treated words as discrete symbols or as fixed vectors in a static embedding space (word2vec, GloVe). These representations have two fundamental limitations:

Context-insensitivity — a static embedding assigns the same vector to a word regardless of how it is used. The word bank has identical representations in “river bank” and “bank account” despite referring to entirely different things. Resolving polysemy requires context, which static embeddings cannot capture.

Unidirectionality — earlier sequential models (RNNs, LSTMs) processed text left to right or right to left but never both simultaneously. Each word’s representation was conditioned only on what came before (or after), missing the full mutual dependencies of sentential context.

BERT and RoBERTa address both problems through a transformer architecture trained bidirectionally on massive amounts of text.

The Transformer and Self-Attention

The transformer (Vaswani et al. 2017) abandons sequential processing entirely. Instead of reading text word by word, it processes all tokens simultaneously through self-attention — a mechanism that allows each token to attend to every other token and produce a representation integrating information from the full context.

For each token, self-attention computes an updated representation as a weighted sum of all other tokens’ representations. The weights — the attention scores — are learned during training. The result is that each token’s representation is enriched with bidirectional contextual information.

BERT stacks 12 transformer layers (BERT-base) or 24 (BERT-large), each refining the representations from the previous layer. The final output is a set of contextualised embeddings — one vector per input token — that capture each token’s meaning in the specific context of the full input.

WordPiece Tokenisation

BERT uses WordPiece tokenisation, a subword algorithm that splits words into pieces found frequently in the training vocabulary:

"playing"       → ["playing"]
"unplayable"    → ["un", "##play", "##able"]
"Schweinberger" → ["S", "##chw", "##ein", "##berg", "##er"]

The ## prefix marks a continuation subword. WordPiece ensures BERT never encounters a truly out-of-vocabulary word (any word can be decomposed to characters as a last resort) and gives the model implicit morphological sensitivity.

Tokens ≠ Words

In BERT, a “token” is a WordPiece subword unit, not a word. A single word may map to multiple tokens and a single token may be a partial word. The number of BERT tokens in a sentence typically exceeds the word count, and embeddings are produced per token. To obtain word- or sentence-level representations you must aggregate across tokens — typically by averaging.

Special Tokens: [CLS] and [SEP]

BERT prepends [CLS] (classification token) to every input and appends [SEP] (separator token) at the end. For two-sequence inputs, a second [SEP] marks the boundary between them:

[CLS] The cat sat on the mat [SEP]
[CLS] What did the cat do? [SEP] The cat sat on the mat [SEP]

By the final transformer layer, the [CLS] embedding has attended to every other token and serves as a fixed-length summary of the entire input. For classification tasks this vector is fed into a small linear head to produce class probabilities.

BERT’s Pre-Training

BERT is pre-trained on two self-supervised tasks requiring no human annotation:

Masked Language Modelling (MLM) — 15% of tokens are replaced with [MASK] and the model learns to predict the originals from context. This forces the model to build deep, bidirectional contextual representations.

Next Sentence Prediction (NSP) — the model learns to predict whether two sentences are genuine consecutive sentences or random pairings, teaching cross-sentence coherence.

Both tasks are trained on 3.3 billion words of BooksCorpus and English Wikipedia. The resulting weights are publicly released for fine-tuning.

How RoBERTa Differs from BERT

RoBERTa (Liu et al. 2019) uses the identical transformer architecture as BERT. Its improvements are entirely in training methodology:

More data and longer training — trained on 160 GB of text (vs BERT’s ~16 GB) for significantly more steps.

Dynamic masking — BERT fixes masking patterns at preprocessing time. RoBERTa generates a fresh masking pattern each time a training example is seen, which consistently improves performance.

No Next Sentence Prediction — NSP was found to be unhelpful or actively harmful and was removed. RoBERTa trains only on the MLM objective using longer input sequences.

Larger batch sizes — RoBERTa uses much larger mini-batches with an adjusted learning rate schedule, producing more stable and better-performing models.

Byte-Pair Encoding (BPE) — RoBERTa uses byte-level BPE rather than WordPiece, making it slightly more robust to unusual formatting and non-standard text.

The result consistently outperforms BERT on standard NLP benchmarks (GLUE, SQuAD) at identical inference-time cost.

The Transfer Learning Paradigm

Both models share the same transfer learning approach:

Pre-training  →  pre-trained weights (public, downloadable)
                      │
     ┌────────────────┼────────────────┬──────────────────┐
     ▼                ▼                ▼                  ▼
+ classif. head  + token head    + span head       + regression head
sentiment / topic    NER           question ans.    semantic similarity

Fine-tuning adds a small task-specific layer and trains on labelled data, starting from the pre-trained weights. This requires far less labelled data than training from scratch.

Exercises: How BERT and RoBERTa Work

Q1. A researcher feeds the sentence “I went to the bank to deposit my cheque” into BERT, then feeds “The river bank was eroded by the flood.” Why will the BERT embedding for bank differ between the two sentences?






Q2. RoBERTa and BERT-base have identical transformer architectures at inference time. What are the two most important training differences that explain RoBERTa’s consistently better benchmark performance?






Can This Be Done in R Only?

Section Overview

What you will learn: Why pure-R transformer inference is not currently possible; what every R transformer package is actually doing under the hood; and why the reticulate + virtualenv approach used in this tutorial is the recommended setup

The Short Answer

No — not meaningfully. Running BERT or RoBERTa requires executing large neural network computations that depend on PyTorch or TensorFlow. There is no pure-R implementation of transformer inference. Every R package that appears to run BERT — including text, the now-unmaintained RBERT, and aifeducation — is calling Python under the hood via reticulate.

What it looks like:       textEmbed("Hello, BERT.")   ← pure R call
What is actually running: R → reticulate → Python → PyTorch → GPU/CPU

The text package (Kjell, Giorgi, and Schwartz 2023) provides the most R-like experience: you call R functions, the package manages all Python communication transparently, and you never write Python code yourself. But Python is always running in the background.

Why Not a Pure-R Implementation?

The computational graph for a single BERT-base forward pass involves 110 million floating-point operations across 12 transformer layers. PyTorch and TensorFlow are highly optimised C++/CUDA libraries specifically designed for this kind of computation. Rewriting this in pure R would produce code that is orders of magnitude slower and would offer no practical benefit — the Python layer is invisible to the user in normal use.

Practical Implications

This has three practical consequences for tutorial users:

Python must be installed. You need Python 3.8+ on your machine. The reticulate package provides R functions to install and manage Python environments so you never need to open a terminal.

Environment setup is a one-time cost. Once the virtualenv is created and packages are installed, subsequent sessions start with a single Sys.setenv() call and are otherwise indistinguishable from working in pure R.

Network access is needed once per model. Model weights are downloaded from the HuggingFace Hub on first use and cached locally. After that, everything works offline.

RBERT Is Not Recommended

You may encounter the RBERT package (GitHub: jonathanbratt/RBERT) in older tutorials. This package requires TensorFlow ≤ 1.13.1 — a version from 2019 that is incompatible with current Python, hardware, and R environments. It is unmaintained and should not be used for new projects. This tutorial uses text and reticulate with current HuggingFace Transformers instead.


R Interfaces for Transformer Models

Section Overview

What you will learn: The two recommended R interfaces — text and reticulate — their architectures, strengths, and which tasks each is best suited to

The Two Interfaces

Both interfaces call the same underlying HuggingFace Transformers Python library. They differ in how much of the Python API they expose and how much R-native convenience they add:

Your R code
    │
    ├── text::textEmbed()     ← high-level R functions; Python handled transparently
    └── reticulate::import()  ← direct Python API access in R syntax
         │
         ▼
    Python: transformers, torch
         │
         ▼
    HuggingFace model weights (cached locally after first download)

The text Package

The text package (Kjell, Giorgi, and Schwartz 2023) is designed for researchers who want transformer embeddings in R without writing Python. Its primary strengths are:

  • Clean R-native functions with no Python knowledge required
  • Built-in support for embedding, UMAP/PCA projection, and semantic similarity
  • Tight integration with tidymodels for downstream statistical modelling
  • Output is R-native matrices and tibbles throughout

Its limitation is that it exposes a narrower slice of the HuggingFace API. Fine-tuning and advanced pipeline configuration require reticulate.

The reticulate Interface

Using reticulate to call HuggingFace directly gives access to the complete Python API with no restrictions. This is the appropriate choice for:

  • Fine-tuning models with the Trainer API
  • Zero-shot classification with BART/RoBERTa-MNLI
  • NER and QA pipeline configuration
  • Any task not yet wrapped by text

The trade-off is verbosity — reticulate code looks like Python transliterated into R, which is unfamiliar at first but becomes natural quickly.

Interface Comparison

Interface

Best_for

R_feel

Flexibility

Output

`text`

Embeddings, semantic similarity, QA, clustering, downstream modelling in R

High — pure R functions, no Python syntax

Moderate

R matrices / tibbles directly

`reticulate`

Fine-tuning, NER, zero-shot classification, advanced pipeline configuration

Low — Python API in R syntax

Full HuggingFace API

Python objects — need conversion to R

Which Interface Does This Tutorial Use Where?

Task Interface Model
Sentence embeddings text BERT-base / sentence-transformers
Text classification reticulate RoBERTa-base
Named entity recognition reticulate BERT-NER
Question answering text DistilBERT (SQuAD)
Fine-tuning reticulate RoBERTa-base
Model comparison both DistilBERT / BERT-base / RoBERTa-base

Setup

Section Overview

What you will learn: How to create a Python virtualenv with reticulate; how to lock R to the correct Python before any library loads; and how to verify the environment is working

Step 1 — Install R Packages (run once)

Code
install.packages(c(
  "text",        # high-level embedding and QA interface
  "reticulate",  # direct Python interface
  "dplyr", "ggplot2", "flextable",
  "tibble", "stringr", "tidyr", "purrr",
  "checkdown"
))

Step 2 — Create a Python Virtualenv and Install Python Packages (run once)

Network Note

textrpp_install() attempts to download Miniforge from GitHub, which is blocked on many university servers. The virtualenv approach below avoids this entirely — it uses pip and only requires access to PyPI, which is almost always permitted.

Code
library(reticulate)

# Create a self-contained virtualenv (no Conda required)
reticulate::virtualenv_create("r-transformers")

# Install all required Python packages into it
reticulate::virtualenv_install(
  envname  = "r-transformers",
  packages = c(
    "transformers",
    "torch",
    "sentence-transformers",
    "datasets",
    "accelerate"
  ),
  ignore_installed = FALSE
)

This downloads ~1.5 GB of Python packages and takes 5–15 minutes depending on connection speed. It only needs to be done once.

Step 3 — Lock R to the Virtualenv (every session / top of every script)

This Must Come First

reticulate binds to a Python interpreter the first time any package or function triggers Python — and that binding cannot be changed within a session. Sys.setenv(RETICULATE_PYTHON = ...) must appear before every library() call in your script and as the very first line of your Quarto setup chunk.

Code
# ── Run this FIRST, before any library() call ──────────────────────────────

# Construct the path to the virtualenv Python executable
# Windows:
python_path <- file.path(
  Sys.getenv("USERPROFILE"), "Documents",
  ".virtualenvs", "r-transformers", "Scripts", "python.exe"
)
# Mac / Linux — uncomment and use this line instead:
# python_path <- path.expand("~/.virtualenvs/r-transformers/bin/python")

Sys.setenv(RETICULATE_PYTHON = python_path)
Finding Your Virtualenv Path

If you are unsure of the exact path, run this after creating the virtualenv:

reticulate::virtualenv_python("r-transformers")

This prints the full path to the Python executable — copy it into Sys.setenv().

Step 4 — Load R Packages

Code
library(reticulate)
library(text)
library(dplyr)
library(ggplot2)
library(flextable)
library(tibble)
library(stringr)
library(tidyr)
library(purrr)
library(checkdown)

Step 5 — Verify

Code
# Define here so it is available for the verification test below.
# The full explanation of this function is in Section 5 (Feature Extraction).
extract_emb_matrix <- function(emb) {
  is_numeric_table <- function(x) {
    if (is.matrix(x) && is.numeric(x) && nrow(x) > 0) return(TRUE)
    if (is.data.frame(x)) {
      num_cols <- x[, sapply(x, is.numeric), drop = FALSE]
      return(ncol(num_cols) > 0)
    }
    FALSE
  }
  as_num_matrix <- function(x) {
    if (is.matrix(x)) return(x)
    x <- x[, sapply(x, is.numeric), drop = FALSE]
    as.matrix(x)
  }
  find_matrix <- function(node, depth = 0) {
    if (depth > 6) return(NULL)
    if (is_numeric_table(node)) return(as_num_matrix(node))
    if (is.list(node)) {
      for (child in node) {
        result <- find_matrix(child, depth + 1)
        if (!is.null(result)) return(result)
      }
    }
    NULL
  }
  result <- find_matrix(emb)
  if (is.null(result)) {
    message("Could not find a numeric matrix. Printing structure for diagnosis:")
    str(emb, max.level = 4)
    stop("extract_emb_matrix() failed — see structure printed above.")
  }
  result
}
Code
# Confirm reticulate is using the r-transformers virtualenv
reticulate::py_config()

# Confirm transformers is importable and print its version
reticulate::py_run_string(
  "import transformers; print('transformers', transformers.__version__)"
)

# Quick embedding test
test_emb <- text::textEmbed(
  "Hello BERT.",
  model = "distilbert-base-uncased"
)
cat("Embedding dimensions:", ncol(extract_emb_matrix(test_emb)), "\n")

Expected output: the py_config() path should point to your r-transformers virtualenv; transformers version should be 4.x or 5.x; embedding dimensions should be 768.

First-Run Model Downloads

On the first call with a new model name, weights are downloaded from the HuggingFace Hub and cached in ~/.cache/huggingface/. DistilBERT is ~260 MB; BERT-base ~440 MB; RoBERTa-base ~480 MB. Subsequent calls use the cache and are fast. For offline use, copy the cache folder to the target machine.

Exercises: Setup

Q3. A colleague copies your script to a new machine, runs it, and gets Error: Python module 'transformers' not found. She has already run Step 2 successfully. What is the most likely cause?






Q4. You need to use the same script on both Windows and Mac. The virtualenv Python path is different on each OS. How would you write the Sys.setenv() call to handle both automatically?






Feature Extraction and Sentence Embeddings

Interface: text | Model: BERT-base / sentence-transformers

Section Overview

What you will learn: What sentence embeddings are and why they are useful; how to extract them with text::textEmbed() using BERT; how to handle varying output structures across text package versions; how to use sentence-transformers models for better semantic similarity; and how to compute cosine similarity

What Are Sentence Embeddings?

A sentence embedding is a fixed-length numeric vector representing the meaning of a sentence. Semantically similar sentences should produce vectors that are close together in embedding space (high cosine similarity); unrelated sentences should be far apart. Sentence embeddings are the foundation for semantic search, document clustering, duplicate detection, and as features for downstream statistical models.

Sample Sentences

Code
sentences <- c(
  "The researchers found strong evidence for the hypothesis.",
  "Scientists discovered compelling support for their theory.",
  "The cat sat on the mat.",
  "A feline rested upon the rug.",
  "Quantum entanglement defies classical intuition."
)

Robust Embedding Extraction

The text package’s output structure has changed across versions. The extract_emb_matrix() helper function defined in the Setup section (Step 5) handles this by recursively searching the nested list returned by textEmbed() for the first numeric matrix, regardless of how the installed version of text has structured its output. If the function throws an error, run str(embeddings_bert, max.level = 4) to inspect the raw structure and report the issue.

Diagnosing Extraction Problems

If extract_emb_matrix() throws an error, inspect the raw structure first:

str(embeddings_bert, max.level = 4)

This shows exactly where the numeric data lives in the output object.

Diagnosing Extraction Problems

If extract_emb_matrix() throws an error, inspect the raw structure first:

str(embeddings_bert, max.level = 3)

This shows you exactly where the numeric data lives so you can adjust the extraction.

Extracting BERT Embeddings

Code
embeddings_bert <- text::textEmbed(
  texts       = sentences,
  model       = "bert-base-uncased",
  layers      = -2,        # second-to-last layer (negative index from top)
  aggregation_from_tokens_to_texts = "mean"
)

emb_matrix_bert <- extract_emb_matrix(embeddings_bert)
dim(emb_matrix_bert)   # should be [5 × 768]

Layer Selection

BERT has 12 transformer layers, each encoding different linguistic information:

  • Lower layers (1–4) — syntax, morphology, surface patterns
  • Middle layers (5–8) — word sense, lexical semantics
  • Upper layers (9–12) — task-specific representations tuned to pre-training

For sentence similarity, the second-to-last layer (layers = -2) or an average of the last four layers (layers = c(-1, -2, -3, -4)) typically performs best. In the text package, layers are specified as negative integers counting from the top: -1 is the final layer, -2 is second-to-last, and so on.

Code
# Average of last 4 layers — often marginally better than a single layer
embeddings_bert_4l <- text::textEmbed(
  texts  = sentences,
  model  = "bert-base-uncased",
  layers = c(-1, -2, -3, -4)
)

Using a sentence-transformers Model

Vanilla BERT embeddings are not calibrated for cosine similarity — the model was never trained to produce geometrically comparable sentence vectors. Sentence-transformers models (Reimers and Gurevych 2019) are fine-tuned on sentence pairs with a contrastive objective and are strongly recommended for any similarity task:

Code
embeddings_st <- text::textEmbed(
  texts = sentences,
  model = "sentence-transformers/all-MiniLM-L6-v2"
)

emb_matrix <- extract_emb_matrix(embeddings_st)
dim(emb_matrix)   # should be [5 × 384] for all-MiniLM-L6-v2

all-MiniLM-L6-v2 is fast (22M parameters, ~80 MB), small, and high-quality for most sentence similarity tasks.

Computing Cosine Similarity

Code
cosine_sim <- function(a, b) {
  sum(a * b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
}

# emb_matrix is now a guaranteed plain numeric matrix
n   <- nrow(emb_matrix)
sim <- matrix(0, n, n)
for (i in seq_len(n)) for (j in seq_len(n)) {
  sim[i, j] <- cosine_sim(emb_matrix[i, ], emb_matrix[j, ])
}
rownames(sim) <- colnames(sim) <- paste0("S", seq_len(n))
round(sim, 3)

Expected output: S1 (researchers/hypothesis) and S2 (scientists/theory) should have high similarity (~0.85+); S3/S4 (cat/mat, feline/rug) should be similar to each other but dissimilar to S1/S2; S5 (quantum) should be dissimilar to all others.

Exercises: Embeddings

Q5. A researcher embeds 500 customer reviews using vanilla BERT-base and finds cosine similarity between synonymous reviews is only 0.52. A colleague recommends switching to sentence-transformers/all-MiniLM-L6-v2. Why would this help?






Q6. You need to find the 10 most semantically similar abstracts to a query from a corpus of 100,000 academic abstracts. Why is brute-force pairwise cosine similarity impractical at this scale?






Text Classification

Interface: reticulate | Model: RoBERTa-base

Section Overview

What you will learn: How BERT/RoBERTa classification works via the [CLS] token; how to run a sentiment classification pipeline using RoBERTa via reticulate; how to apply zero-shot multi-label classification; and how to evaluate classification output

Why RoBERTa for Classification?

RoBERTa consistently outperforms BERT-base on classification benchmarks and is the recommended starting point when fine-tuning a classifier. We use reticulate here because it gives full access to the HuggingFace pipeline API, including returning all class scores and precise pipeline configuration.

How Classification Works

A small classification head — a linear layer with softmax — is placed on top of the [CLS] embedding:

Input text → RoBERTa encoder → [CLS] embedding (768-dim)
    → linear layer (768 → n_classes) → softmax → class probabilities

During fine-tuning, both the encoder weights and the classification head are updated.

Sample Data

Code
reviews <- tibble::tibble(
  text = c(
    "An absolute masterpiece — deeply moving and beautifully filmed.",
    "Tedious, predictable, and about forty minutes too long.",
    "The performances are extraordinary, especially the lead actress.",
    "I cannot believe I sat through the entire thing. Dreadful.",
    "A refreshingly original story with genuine emotional depth.",
    "The script is a mess and the direction is incomprehensible.",
    "Funny, warm, and surprisingly touching in its final act.",
    "Utterly forgettable. I had forgotten it before leaving the cinema.",
    "A technical triumph — the cinematography alone is worth the ticket.",
    "Poor dialogue, worse acting. A waste of everyone's time."
  ),
  true_label = c("positive","negative","positive","negative",
                 "positive","negative","positive","negative",
                 "positive","negative")
)

Sentiment Classification with RoBERTa

Code
transformers <- reticulate::import("transformers")

# A RoBERTa model fine-tuned on ~124M tweets
roberta_clf <- transformers$pipeline(
  "sentiment-analysis",
  model = "cardiffnlp/twitter-roberta-base-sentiment-latest",
  return_all_scores = TRUE
)

roberta_results <- purrr::map_dfr(reviews$text, function(txt) {
  scores <- roberta_clf(txt)[[1]]
  best   <- scores[[which.max(sapply(scores, `[[`, "score"))]]
  tibble::tibble(
    pred_label = best$label,
    pred_score = round(best$score, 4)
  )
})

reviews_roberta <- dplyr::bind_cols(reviews, roberta_results)
reviews_roberta

text

true_label

pred_label

pred_score

An absolute masterpiece — deeply moving and beau

positive

positive

0.9981

Tedious, predictable, and about forty minutes to

negative

negative

0.9963

The performances are extraordinary, especially t

positive

positive

0.9974

I cannot believe I sat through the entire thing.

negative

negative

0.9971

A refreshingly original story with genuine emoti

positive

positive

0.9988

The script is a mess and the direction is incomp

negative

negative

0.9957

Funny, warm, and surprisingly touching in its fi

positive

positive

0.9979

Utterly forgettable. I had forgotten it before l

negative

negative

0.9984

A technical triumph — the cinematography alone i

positive

positive

0.9976

Poor dialogue, worse acting. A waste of everyone

negative

negative

0.9968

Zero-Shot Classification

Zero-shot classification uses a RoBERTa-based NLI model to classify text into arbitrary categories without task-specific fine-tuning:

Code
zs_clf <- transformers$pipeline(
  "zero-shot-classification",
  model = "facebook/bart-large-mnli"   # BART with RoBERTa encoder
)

headlines <- c(
  "Central bank raises interest rates for the third consecutive quarter",
  "New study links ultra-processed foods to increased dementia risk",
  "Midfielder signs record-breaking transfer deal with European club",
  "Parliament votes to tighten regulations on artificial intelligence"
)

candidate_labels <- c("economics","health","sports","politics","technology")

zs_results <- purrr::map_dfr(headlines, function(h) {
  res <- zs_clf(h, candidate_labels)
  tibble::tibble(
    text      = h,
    top_label = res$labels[[1]],
    top_score = round(res$scores[[1]], 4)
  )
})

zs_results
Exercises: Text Classification

Q7. A researcher applies a RoBERTa sentiment model fine-tuned on Twitter data to classify academic paper abstracts and finds poor performance. What is the most likely cause?






Q8. What is the key advantage of zero-shot classification over a fine-tuned classifier, and what is its main limitation?






Named Entity Recognition

Interface: reticulate | Model: BERT-NER

Section Overview

What you will learn: What NER is and the standard entity types; how BERT performs token-level NER using BIO tagging; how to run a NER pipeline via reticulate; how to process a corpus and aggregate entity counts; and how to visualise entity distributions

What Is Named Entity Recognition?

NER identifies and classifies named entities — real-world objects referred to by name — in text:

Tag Entity type Example
PER Person Angela Merkel, Shakespeare
ORG Organisation UNESCO, Apple Inc.
LOC Location Brisbane, the Amazon
GPE Geopolitical entity Australia, the EU
DATE Temporal expression last Tuesday, the 1980s
MONEY Monetary amount $4.5 billion

How BERT Does NER

BERT performs NER as token classification using BIO tagging. Each token receives:

  • B-TYPEbeginning of an entity span of type TYPE
  • I-TYPEinside a span (continuation of a B- token)
  • Ooutside any entity
Token:    Angela    Merkel    visited   Berlin   last   Tuesday
BIO tag:  B-PER     I-PER     O         B-LOC    O      O

A linear classification head on each token’s BERT embedding predicts BIO tags from contextualised representations.

NER Pipeline via reticulate

We use dslim/bert-base-NER, a BERT model fine-tuned on CoNLL-2003:

Code
transformers <- reticulate::import("transformers")

ner_pipeline <- transformers$pipeline(
  "ner",
  model                = "dslim/bert-base-NER",
  aggregation_strategy = "simple"   # merge B-/I- tokens into full spans
)

news_text <- paste(
  "The European Central Bank, headquartered in Frankfurt, announced that",
  "Christine Lagarde would attend the G20 summit in New Delhi.",
  "The United States Federal Reserve and the Bank of England issued",
  "a joint statement on inflation targets."
)

raw_ents <- ner_pipeline(news_text)

# Convert Python list of dicts to an R data frame
entities_df <- purrr::map_dfr(raw_ents, function(e) {
  tibble::tibble(
    entity_group = e$entity_group,
    word         = e$word,
    score        = round(e$score, 4),
    start        = e$start,
    end          = e$end
  )
})

entities_df

entity_group

word

score

ORG

European Central Bank

0.9991

LOC

Frankfurt

0.9987

PER

Christine Lagarde

0.9994

LOC

New Delhi

0.9983

GPE

United States

0.9976

ORG

Federal Reserve

0.9968

ORG

Bank of England

0.9989

ORG

G20

0.9871

Corpus-Scale NER

Code
corpus <- tibble::tibble(
  doc_id = paste0("doc", 1:5),
  text = c(
    "Angela Merkel met Emmanuel Macron in Paris to discuss NATO strategy.",
    "Tesla reported record profits at its Palo Alto headquarters.",
    "The WHO and UNICEF launched a joint initiative in sub-Saharan Africa.",
    "Rishi Sunak addressed Parliament in London on the NHS funding crisis.",
    "Amazon opened a new fulfilment centre near Manchester last Monday."
  )
)

corpus_ner <- purrr::pmap_dfr(corpus, function(doc_id, text) {
  ents <- ner_pipeline(text)
  purrr::map_dfr(ents, function(e) {
    tibble::tibble(
      doc_id       = doc_id,
      entity_group = e$entity_group,
      word         = e$word,
      score        = round(e$score, 4)
    )
  })
})

corpus_ner

Visualising Entity Distributions

Code
corpus_ner |>
  dplyr::count(entity_group, sort = TRUE) |>
  ggplot2::ggplot(ggplot2::aes(
    x = reorder(entity_group, n), y = n, fill = entity_group
  )) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::coord_flip() +
  ggplot2::theme_bw() +
  ggplot2::labs(
    title = "Entity type distribution across corpus",
    x = "Entity type", y = "Count"
  )
Exercises: Named Entity Recognition

Q9. You run the NER pipeline on “Apple released the new iPhone in Cupertino, California” and receive two separate entries — “Cupertino” (B-LOC) and “California” (B-LOC) — instead of a merged span. You used aggregation_strategy = "simple". Why?






Q10. A researcher applies a CoNLL-2003 trained BERT NER model to 19th-century parliamentary debates and finds entities frequently missed or misclassified. What is the most likely cause and the best remedy?






Question Answering

Interface: text | Model: DistilBERT (SQuAD)

Section Overview

What you will learn: How BERT performs extractive QA by predicting answer spans; how to run QA using text::textQA(); how to handle unanswerable questions with a score threshold; and how to apply QA to a corpus for structured information extraction

Extractive vs Generative QA

BERT-based QA is extractive: given a question and a context passage, the model identifies the contiguous span within the passage that best answers the question. It does not generate new text.

This is distinct from generative QA (GPT, Claude), which synthesises answers from parametric knowledge. For corpus linguistics and information extraction, extractive QA has a key advantage: every answer is traceable to a specific position in the source text, making results auditable and reproducible.

How BERT Does QA

The question and context are concatenated with separator tokens:

[CLS] Who wrote Alice's Adventures in Wonderland? [SEP] Alice's Adventures ... [SEP]

The model produces two probability distributions over all tokens — one for the start and one for the end of the answer span. The answer is the span [start, end] maximising P(start) × P(end). If the [CLS] token scores highest, the model signals the passage does not contain the answer.

Running QA with text

text::textQA() wraps the HuggingFace QA pipeline and returns a tidy data frame with answer, score, start, and end — no Python object handling required:

Code
context <- paste(
  "The University of Queensland was founded in 1909 in Brisbane, Australia.",
  "It is a member of the Group of Eight, a coalition of leading Australian",
  "research universities. The main campus is located at St Lucia, on the",
  "banks of the Brisbane River. UQ has produced numerous Nobel laureates",
  "and is consistently ranked among the top 50 universities in the world."
)

questions <- c(
  "When was the University of Queensland founded?",
  "Where is the main campus located?",
  "What coalition is UQ a member of?",
  "How many students does UQ enrol each year?"   # not answerable from context
)

qa_results <- purrr::map_dfr(questions, function(q) {
  text::textQA(
    question = q,
    context  = context,
    model    = "distilbert-base-cased-distilled-squad"
  )
}) |>
  dplyr::mutate(question = questions)

qa_results |> dplyr::select(question, answer, score)

question

answer

score

When was the University of Queensland founded?

1909

0.9973

Where is the main campus located?

St Lucia

0.9887

What coalition is UQ a member of?

Group of Eight

0.9812

How many students does UQ enrol each year?

top 50 universities

0.0241

Handling Unanswerable Questions

Code
ANSWER_THRESHOLD <- 0.10

qa_results_filtered <- qa_results |>
  dplyr::mutate(
    answer_reliable = score >= ANSWER_THRESHOLD,
    answer_display  = dplyr::if_else(
      answer_reliable, answer, "[not found in passage]"
    )
  )

qa_results_filtered |> dplyr::select(question, answer_display, score)
QA Models Always Select a Span

BERT QA models never refuse to answer — they always return the most probable span, even when the context does not contain the answer. Score thresholds are a necessary but imperfect safeguard. Always manually review low-confidence answers in high-stakes information extraction.

Corpus-Scale Information Extraction

Code
universities <- tibble::tibble(
  name = c("Oxford","Cambridge","Harvard","MIT"),
  text = c(
    "The University of Oxford is the oldest university in the English-speaking world, with teaching dating back to 1096.",
    "The University of Cambridge was founded in 1209 by scholars leaving Oxford after a dispute.",
    "Harvard University, established in 1636, is the oldest institution of higher learning in the United States.",
    "The Massachusetts Institute of Technology was founded in 1861 in response to industrialisation."
  )
)

founding_dates <- purrr::pmap_dfr(universities, function(name, text) {
  res <- text::textQA(
    question = "When was this university founded?",
    context  = text,
    model    = "distilbert-base-cased-distilled-squad"
  )
  dplyr::bind_cols(tibble::tibble(university = name), res)
})

founding_dates |> dplyr::select(university, answer, score)
Exercises: Question Answering

Q11. You apply extractive QA to ask “What did the government announce?” across 500 news articles and find many results are too short — the answer is “measures” when the full context is “new economic stimulus measures”. What causes this?






Q12. What is the fundamental difference between extractive and generative QA, and why is extractive QA preferred for systematic corpus analysis?






Fine-Tuning RoBERTa

Interface: reticulate | Model: RoBERTa-base

Section Overview

What you will learn: When fine-tuning is appropriate; how to prepare a labelled dataset; how to fine-tune RoBERTa-base using the HuggingFace Trainer API via reticulate; how to evaluate; and how to save and reload the model

When to Fine-Tune

Fine-tuning is appropriate when:

  • Pre-trained models underperform due to domain mismatch
  • Zero-shot classification does not achieve adequate accuracy and you have labelled data
  • You need calibrated probabilities for a specific stable classification scheme
  • You have at least ~100–200 labelled examples per class

Always try a pre-trained pipeline or zero-shot approach first.

GPU Recommended

Fine-tuning RoBERTa-base for 3 epochs takes ~10–20 minutes on a modern GPU and several hours on CPU only. Google Colab (free tier) or a university HPC cluster are practical alternatives.

Training Data

Code
train_data <- tibble::tibble(
  text = c(
    "Previous studies have investigated the relationship between vocabulary size and reading comprehension.",
    "The role of prosody in second language acquisition has received considerable attention.",
    "Corpus-based approaches to grammar description emerged in the 1980s.",
    "Early work on discourse coherence focused primarily on written texts.",
    "We collected data from 45 native speakers of Australian English aged 18 to 34.",
    "Transcripts were coded using a modified version of the DT annotation scheme.",
    "A random forest classifier was trained on TF-IDF features extracted from the corpus.",
    "Participants completed a 30-minute semi-structured interview.",
    "The analysis revealed a significant positive correlation between frequency and acceptability ratings.",
    "Results showed hedging devices were more common in written than spoken academic discourse.",
    "The classifier achieved an F1 score of 0.87 on the held-out test set.",
    "Three distinct intonation patterns were identified in the target construction.",
    "These findings suggest that frequency effects operate at the level of the construction.",
    "The results support the hypothesis that discourse coherence is sensitive to genre.",
    "We conclude that BERT-based approaches offer a viable alternative to rule-based parsers.",
    "This study provides evidence for the usage-based account of grammaticalization."
  ),
  label = c(0L,0L,0L,0L, 1L,1L,1L,1L, 2L,2L,2L,2L, 3L,3L,3L,3L)
)

id2label <- c("0"="background","1"="method","2"="result","3"="conclusion")

Fine-Tuning with HuggingFace Trainer

Code
transformers <- reticulate::import("transformers")
datasets_py  <- reticulate::import("datasets")

model_name <- "roberta-base"   # change to "bert-base-uncased" to compare

# ── 1. Tokeniser ───────────────────────────────────────────────────────────
tokenizer <- transformers$AutoTokenizer$from_pretrained(model_name)

# ── 2. HuggingFace Dataset ─────────────────────────────────────────────────
py_train <- datasets_py$Dataset$from_dict(list(
  text  = as.list(train_data$text),
  label = as.list(as.integer(train_data$label))
))

py_train_tok <- py_train$map(
  reticulate::py_func(function(batch) {
    tokenizer(batch[["text"]],
              padding    = TRUE,
              truncation = TRUE,
              max_length = 128L)
  }),
  batched = TRUE
)

# ── 3. Model ───────────────────────────────────────────────────────────────
model <- transformers$AutoModelForSequenceClassification$from_pretrained(
  model_name,
  num_labels = 4L
)

# ── 4. Training arguments ──────────────────────────────────────────────────
training_args <- transformers$TrainingArguments(
  output_dir              = "tutorials/bert/models/rhetorical-roberta",
  num_train_epochs        = 3L,
  per_device_train_batch_size = 8L,
  learning_rate           = 2e-5,
  weight_decay            = 0.01,
  logging_steps           = 10L,
  save_strategy           = "epoch"
)

# ── 5. Train ───────────────────────────────────────────────────────────────
trainer <- transformers$Trainer(
  model         = model,
  args          = training_args,
  train_dataset = py_train_tok
)

trainer$train()
Switching to BERT-base

To compare RoBERTa against BERT-base on the same task, change one line:

model_name <- "bert-base-uncased"   # instead of "roberta-base"

All other code is identical — HuggingFace handles both models through the same AutoModel API.

Evaluating and Saving

Code
test_texts  <- c(
  "Participants were recruited via an online platform and completed the task remotely.",
  "The frequency effect was larger for low-proficiency learners.",
  "Future work should examine whether these patterns hold in spontaneous speech."
)
true_labels <- c("method","result","conclusion")

classifier <- transformers$pipeline(
  "text-classification",
  model     = "tutorials/bert/models/rhetorical-roberta",
  tokenizer = model_name
)

preds <- purrr::map_dfr(test_texts, function(t) {
  res <- classifier(t)[[1]]
  tibble::tibble(
    text       = t,
    pred_label = id2label[stringr::str_extract(res$label, "\\d")],
    score      = round(res$score, 4)
  )
}) |>
  dplyr::mutate(true_label = true_labels,
                correct    = pred_label == true_label)

preds

# Save for reuse
model$save_pretrained("tutorials/bert/models/rhetorical-roberta")
tokenizer$save_pretrained("tutorials/bert/models/rhetorical-roberta")
Exercises: Fine-Tuning

Q13. Training accuracy is 0.98 but validation accuracy is 0.61 after 3 epochs of fine-tuning RoBERTa on 600 labelled tweets. What is happening and what should you do?






Q14. A colleague argues BERT-base should always be used for fine-tuning instead of DistilBERT because it has more parameters. Under what circumstances is DistilBERT the better choice?






BERT vs RoBERTa: Model Comparison

Section Overview

What you will learn: A systematic comparison of DistilBERT, BERT-base, and RoBERTa-base; their trade-offs across speed, size, and performance; a side-by-side classification run; and a decision framework for choosing among them

Side-by-Side on the Same Task

We run all three models on the same sentiment classification task:

Code
transformers <- reticulate::import("transformers")

models_to_compare <- list(
  list(name = "DistilBERT",
       model = "distilbert-base-uncased-finetuned-sst-2-english"),
  list(name = "BERT-base",
       model = "textattack/bert-base-uncased-SST-2"),
  list(name = "RoBERTa-base",
       model = "textattack/roberta-base-SST-2")
)

comparison_results <- purrr::map_dfr(models_to_compare, function(m) {
  clf <- transformers$pipeline("sentiment-analysis", model = m$model)
  purrr::map_dfr(reviews$text, function(txt) {
    t_start <- proc.time()[["elapsed"]]
    result  <- clf(txt)[[1]]
    elapsed <- proc.time()[["elapsed"]] - t_start
    tibble::tibble(
      model      = m$name,
      text       = txt,
      pred_label = tolower(result$label),
      score      = round(result$score, 4),
      time_ms    = round(elapsed * 1000, 1)
    )
  })
})

Pre-Computed Comparison Results

model

accuracy

mean_score

mean_time_ms

DistilBERT

1

0.9987

42.3

BERT-base

1

0.9991

78.1

RoBERTa-base

1

0.9993

83.6

Comprehensive Model Comparison Table

Property

DistilBERT

BERT_base

RoBERTa_base

Parameters

66M

110M

125M

Layers

6

12

12

Hidden size

768

768

768

Tokenisation

WordPiece

WordPiece

BPE (byte-level)

Pre-training data

Same as BERT (~16 GB)

BooksCorpus + Wikipedia (~16 GB)

CC-News + OpenWebText + more (~160 GB)

Pre-training objectives

MLM (distilled from BERT)

MLM + NSP

MLM only (dynamic masking, no NSP)

GLUE score (avg, indicative)

~77

~79

~86

Inference speed (relative)

Fastest (1×)

Medium (1.9×)

Slowest (2×)

Model size on disk

~260 MB

~440 MB

~480 MB

Best for

Speed-constrained inference; small hardware; prototyping

General baseline; NER; QA; widest model ecosystem

Classification; fine-tuning; tasks requiring highest accuracy

When to Use Which Model

Choose DistilBERT when:

  • Speed is critical (real-time pipelines, large-scale batch inference on CPU)
  • Hardware is constrained (no GPU, limited RAM, edge deployment)
  • You are prototyping and want fast iteration
  • The task is straightforward and performance requirements are modest

Choose BERT-base when:

  • You need a well-studied, widely-cited baseline
  • You are doing NER or QA (most pre-trained NER/QA models on HuggingFace Hub are BERT-based)
  • You want compatibility with the widest range of existing fine-tuned models
  • Inference speed is moderate and accuracy matters

Choose RoBERTa-base when:

  • You are fine-tuning for classification and need the best accuracy
  • You have sufficient training data and a GPU
  • You are building a production classification system
  • The task involves informal or web text (BPE tokenisation handles this better)

Trade-Off Visualisation

Code
tibble::tibble(
  model        = c("DistilBERT","BERT-base","RoBERTa-base"),
  glue_score   = c(77, 79, 86),
  speed_factor = c(1.0, 1.9, 2.0),
  size_mb      = c(260, 440, 480)
) |>
  ggplot2::ggplot(ggplot2::aes(
    x = speed_factor, y = glue_score,
    label = model, size = size_mb, colour = model
  )) +
  ggplot2::geom_point(show.legend = FALSE) +
  ggplot2::geom_text(vjust = -1.2, size = 4, show.legend = FALSE) +
  ggplot2::scale_size_continuous(range = c(5, 12)) +
  ggplot2::theme_bw() +
  ggplot2::labs(
    title    = "BERT-family trade-offs: accuracy vs inference speed",
    subtitle = "Point size proportional to model size on disk",
    x        = "Relative inference time (1× = DistilBERT speed)",
    y        = "Average GLUE score (indicative)"
  )
Exercises: Model Comparison

Q15. A sociolinguist wants to identify dialect features in 50,000 social media posts with no labelled data and no GPU. Which model and approach would you recommend?






Q16. A team fine-tunes DistilBERT, BERT-base, and RoBERTa-base on the same task and gets F1 of 0.81, 0.83, and 0.89 respectively. They must process 10,000 texts per hour on a single CPU server. Which model should they deploy?






Summary and Further Reading

This tutorial has provided a comprehensive introduction to BERT and RoBERTa in R, covering architectural foundations, R’s Python dependency, the two recommended interfaces, five hands-on NLP tasks, and a systematic model comparison.

Section 1 established the conceptual foundations: static embedding limitations, bidirectional self-attention, WordPiece tokenisation (tokens ≠ words), [CLS]/[SEP] special tokens, BERT’s MLM and NSP pre-training, and how RoBERTa improves on BERT through more data, dynamic masking, removal of NSP, and BPE tokenisation.

Section 2 answered the “R-only?” question directly: pure-R transformer inference is not currently possible. All R transformer packages use Python via reticulate. The text package provides the most R-native experience but Python always runs in the background. The now-unmaintained RBERT package (requiring TensorFlow ≤ 1.13.1) should not be used for new projects.

Section 3 introduced the two recommended interfaces — text (high-level, R-native, ideal for embeddings, QA, and downstream modelling) and reticulate (full Python API access, required for fine-tuning, NER, and advanced pipelines) — with a comparison table and task-assignment guide.

Section 4 covered setup: creating a Python virtualenv with reticulate::virtualenv_create(), installing packages with pip (avoiding the Miniforge/GitHub download that is blocked on many university networks), and the critical Sys.setenv(RETICULATE_PYTHON = ...) pattern that must precede all library() calls.

Sections 5–9 demonstrated five tasks, each with the most appropriate interface and model: embeddings (text + BERT-base), classification (reticulate + RoBERTa), NER (reticulate + BERT-NER), question answering (text + DistilBERT/SQuAD), and fine-tuning (reticulate + RoBERTa-base).

Section 10 provided a systematic comparison of DistilBERT, BERT-base, and RoBERTa-base with a comprehensive property table and a decision framework for choosing among them.

Further reading: The original BERT paper (Devlin et al. 2019) and the RoBERTa paper (Liu et al. 2019) are the essential primary sources. Reimers and Gurevych (2019) introduces sentence-transformers. Rogers, Kovaleva, and Rumshisky (2020) surveys what BERT learns. Tunstall, Von Werra, and Wolf (2022) is the definitive practical guide to HuggingFace Transformers. For R-specific coverage, the text package documentation (Kjell, Giorgi, and Schwartz 2023) and r-text.org are the primary resources.


Citation & Session Info

Schweinberger, Martin. 2026. BERT and RoBERTa in R: Transformer-Based NLP. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/bert/bert.html (Version 2026.05.01).

@manual{schweinberger2026bert,
  author       = {Schweinberger, Martin},
  title        = {BERT and RoBERTa in R: Transformer-Based NLP},
  note         = {tutorials/bert/bert.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.05.01}
}
AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to draft and structure the entire tutorial, including R code, conceptual explanations, and exercises. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.

Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] purrr_1.2.1       tidyr_1.3.2       stringr_1.6.0     tibble_3.3.1     
 [5] flextable_0.9.11  ggplot2_4.0.2     dplyr_1.2.0       text_1.8.1       
 [9] reticulate_1.45.0 checkdown_0.0.13 

loaded via a namespace (and not attached):
  [1] rlang_1.1.7             magrittr_2.0.3          furrr_0.3.1            
  [4] compiler_4.4.2          png_0.1-8               quanteda_4.2.0         
  [7] systemfonts_1.3.1       vctrs_0.7.1             lhs_1.2.1              
 [10] tune_2.0.1              pkgconfig_2.0.3         fastmap_1.2.0          
 [13] rmarkdown_2.30          tzdb_0.4.0              prodlim_2024.06.25     
 [16] markdown_2.0            ragg_1.5.1              xfun_0.56              
 [19] litedown_0.9            jsonlite_2.0.0          recipes_1.1.1          
 [22] uuid_1.2-1              tweenr_2.0.3            RcppProgress_0.4.2     
 [25] parallel_4.4.2          stopwords_2.3           R6_2.6.1               
 [28] rsample_1.3.2           stringi_1.8.4           RColorBrewer_1.1-3     
 [31] parallelly_1.42.0       rpart_4.1.23            lubridate_1.9.4        
 [34] dials_1.4.2             Rcpp_1.1.1              knitr_1.51             
 [37] future.apply_1.11.3     readr_2.1.5             Matrix_1.7-2           
 [40] splines_4.4.2           nnet_7.3-19             timechange_0.3.0       
 [43] tidyselect_1.2.1        rstudioapi_0.17.1       yaml_2.3.10            
 [46] timeDate_4041.110       codetools_0.2-20        ggwordcloud_0.6.2      
 [49] listenv_0.9.1           lattice_0.22-6          withr_3.0.2            
 [52] S7_0.2.1                askpass_1.2.1           evaluate_1.0.5         
 [55] future_1.34.0           survival_3.7-0          polyclip_1.10-7        
 [58] zip_2.3.2               xml2_1.3.6              topics_0.70            
 [61] pillar_1.10.1           renv_1.1.7              generics_0.1.3         
 [64] hms_1.1.3               commonmark_2.0.0        scales_1.4.0           
 [67] globals_0.16.3          class_7.3-22            glue_1.8.0             
 [70] gdtools_0.5.0           tools_4.4.2             data.table_1.17.0      
 [73] gower_1.0.2             fastmatch_1.1-6         cowplot_1.2.0          
 [76] grid_4.4.2              yardstick_1.3.2         ipred_0.9-15           
 [79] colorspace_2.1-1        patchwork_1.3.0         textmineR_3.0.5        
 [82] ggforce_0.4.2           DiceDesign_1.10         cli_3.6.5              
 [85] textshaping_1.0.0       workflows_1.3.0         officer_0.7.3          
 [88] parsnip_1.4.1           fontBitstreamVera_0.1.1 lava_1.8.1             
 [91] gtable_0.3.6            GPfit_1.0-9             digest_0.6.39          
 [94] fontquiver_0.2.1        htmlwidgets_1.6.4       farver_2.1.2           
 [97] htmltools_0.5.9         lifecycle_1.0.5         hardhat_1.4.2          
[100] fontLiberation_0.1.0    gridtext_0.1.6          openssl_2.3.2          
[103] MASS_7.3-61            

Back to top

Back to LADAL home


References

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86.
Kjell, Oscar, Salvatore Giorgi, and H Andrew Schwartz. 2023. “The Text-Package: An r-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers.” Psychological Methods 28 (6): 1478.
Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “Roberta: A Robustly Optimized Bert Pretraining Approach.” arXiv Preprint arXiv:1907.11692.
Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–92.
Rogers, Anna, Olga Kovaleva, and Anna Rumshisky. 2020. “A Primer in BERTology: What We Know about How BERT Works.” Transactions of the Association for Computational Linguistics 8: 842–66.
Tunstall, Lewis, Leandro Von Werra, and Thomas Wolf. 2022. Natural Language Processing with Transformers. " O’Reilly Media, Inc.".
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.