Word Embeddings and Vector Semantics

Author

Martin Schweinberger

Published

January 1, 2026

Introduction

This tutorial introduces word embeddings — dense vector representations of words that capture semantic meaning — and demonstrates how to train, explore, visualise, and apply them in R. Word embeddings are one of the most influential ideas in modern computational linguistics and natural language processing. They underpin everything from search engines and machine translation to sentiment analysis and the large language models that have become central to research workflows.

The tutorial covers the theoretical foundations of vector semantics, the mechanics of the word2vec algorithm, hands-on training of embedding models from raw text, finding semantically similar words, performing word analogies, visualising embedding spaces with t-SNE, using pre-trained embeddings such as GloVe and fastText, and applying embeddings to research questions in linguistics including semantic change, bias detection, and metaphor analysis.

Learning Objectives

By the end of this tutorial you will be able to:

Explain the distributional hypothesis and its role as the theoretical foundation of word embeddings
Distinguish between count-based, prediction-based, and contextualised embedding approaches
Describe how the word2vec skip-gram algorithm learns word vectors from text
Train a word2vec model in R using the word2vec package
Find semantically similar words using cosine similarity
Perform word analogies via vector arithmetic and interpret the results
Visualise high-dimensional embeddings in 2D using t-SNE
Load and query pre-trained GloVe embeddings
Apply embeddings to linguistic research questions including semantic change and bias detection
Choose between training custom embeddings and using pre-trained models for a given task

Prerequisite Tutorials

Before working through this tutorial, you should be comfortable with the content of:

A basic understanding of what vectors and matrices are is helpful, but no advanced linear algebra is required.

Citation

Martin Schweinberger. 2026. Word Embeddings and Vector Semantics. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/embeddings_tutorial/embeddings_tutorial.html (Version 2026.03.27), doi: .

Part 1: The Theory of Word Embeddings

Section Overview

What you will learn: The distributional hypothesis; why traditional word representations fail; the geometry of semantic space; and the three main families of word embeddings

The distributional hypothesis

The theoretical foundation of word embeddings is a deceptively simple idea known as the distributional hypothesis, articulated most memorably by the British linguist J. R. Firth:

“You shall know a word by the company it keeps” (Firth 1957, 11)

The core claim is that words appearing in similar contexts have similar meanings. Consider three sentences: “The cat sat on the mat,” “The dog sat on the mat,” and “The car drove down the street.” Because cat and dog share context words (sat, mat) while car does not, the distributional hypothesis predicts that cat and dog should be semantically closer to each other than either is to car. Word embeddings operationalise this prediction by assigning each word a position in a multidimensional space such that distributional similarity translates into geometric proximity.

This is a linguistically rich idea with deep roots. The distributional hypothesis connects to structuralist notions of paradigmatic and syntagmatic relations, to corpus-based approaches to lexical meaning, and to usage-based theories that treat meaning as emerging from patterns of use in context (Jurafsky and Martin 2024, ch. 6). The fact that meaning can be approximated from co-occurrence statistics alone — without any hand-crafted semantic knowledge — has proven to be one of the most powerful ideas in computational linguistics.

The problem with traditional representations

Before embeddings became widespread, words were typically represented as one-hot vectors: binary vectors with a 1 in the position corresponding to that word and 0 everywhere else.

cat  = [1, 0, 0, 0, 0, ..., 0]   (10,000 dimensions)
dog  = [0, 1, 0, 0, 0, ..., 0]
car  = [0, 0, 1, 0, 0, ..., 0]

This representation has several fatal problems for semantic tasks. First, there is no way to measure similarity: cat and dog are as different from each other (cosine similarity = 0) as cat and car. Second, the representations are extremely sparse — almost entirely zeros — which is computationally wasteful. Third, the dimensionality grows with vocabulary size, reaching tens or hundreds of thousands of dimensions for any realistic corpus.

Word embeddings solve all of these problems:

cat  = [0.2, -0.4,  0.7, ...,  0.1]   (100–300 dimensions)
dog  = [0.3, -0.5,  0.8, ...,  0.2]   (similar to cat)
car  = [-0.1, 0.6, -0.3, ...,  0.4]   (different from cat/dog)

The representations are dense (most values non-zero), low-dimensional (typically 50–300), and semantically structured: similar words cluster together and the geometric relationships between vectors encode meaningful semantic relationships.

Vector space geometry

In an embedding space, similarity is measured by cosine similarity — the cosine of the angle between two vectors:

\[\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}\]

Cosine similarity ranges from –1 (opposite directions) to +1 (identical direction). For word embeddings, values between 0.3 and 0.9 are typical for semantically related words.

One of the most striking properties of well-trained embeddings is that vector arithmetic preserves semantic relationships. The famous example (Mikolov et al. 2013):

\[\text{embedding}(\textit{king}) - \text{embedding}(\textit{man}) + \text{embedding}(\textit{woman}) \approx \text{embedding}(\textit{queen})\]

Geometrically, this says that the direction from man to king (representing something like “royalty” or “leadership relative to gender”) is roughly the same as the direction from woman to queen. This parallelism reflects the genuine regularity of the relationship in the training data.

Types of word embeddings

Three broad approaches to learning word embeddings have been developed (Almeida and Xexéo 2019):

Count-based methods (classical) — build a co-occurrence matrix counting how often each pair of words appears within a context window, then apply dimensionality reduction (typically Singular Value Decomposition) to obtain dense vectors. Examples include Latent Semantic Analysis (LSA) and Hyperspace Analogue to Language (HAL). These methods are transparent and interpretable, but less effective than prediction-based methods for most tasks.

Prediction-based methods (modern) — train a neural network to predict context words from a target word (skip-gram) or a target word from its context words (CBOW). The embedding vectors are the learned weights of this network. Examples include word2vec (Mikolov et al. 2013), GloVe (Pennington, Socher, and Manning 2014), and fastText (Bojanowski et al. 2017). These are the most widely used methods in linguistic research.

Contextualised embeddings (current frontier) — produce different vectors for the same word depending on its context, capturing polysemy and disambiguation. Examples include ELMo, BERT, and GPT. The same word bank gets a different vector in “river bank” than in “savings bank.” These models are state-of-the-art for most NLP tasks but require substantially more computational resources.

This tutorial focuses on word2vec and GloVe, which remain the methods of choice for most linguistic research applications.

The word2vec algorithm

Word2vec (Mikolov et al. 2013) introduced two architectures for learning embeddings:

Skip-gram — given a target word, predict the surrounding context words. The network learns to associate each target word with the types of contexts it appears in.

CBOW (Continuous Bag of Words) — given the surrounding context words, predict the target word. Generally faster to train and works better for frequent words; skip-gram works better for rare words and smaller datasets.

The skip-gram training process works as follows. For the sentence “The quick brown fox jumps,” with target word brown and a window of 2:

Training pairs generated: (brown, the), (brown, quick), (brown, fox), (brown, jumps)

The neural network learns to maximise the probability of predicting these context words from the target word, adjusting the embedding vectors at each step to reduce prediction error. After training on many such pairs across the corpus, words that frequently share contexts end up with similar vectors.

Key hyperparameters are summarised below:

Parameter	What it controls	Typical values
`dim`	Embedding dimensions	50–300
`window`	Context window size	5–10
`min_count`	Minimum word frequency	5–10
`type`	Architecture	`"skip-gram"` or `"cbow"`
`iter`	Training iterations	10–50

Hyperparameter trade-offs

More dimensions capture more nuance but risk overfitting and slow training. A larger window captures broader topical relationships; a smaller window captures tighter syntactic relationships. More iterations improve learning but with diminishing returns. A good starting point is dim = 100, window = 5, iter = 20, then experiment based on results.

Exercises: Theory

Q1. The word2vec model is trained on a corpus where the word “awful” appears frequently in contexts alongside words like “terrible,” “dreadful,” and “horrible.” The word “awesome” appears frequently alongside “amazing,” “incredible,” and “brilliant.” Based on the distributional hypothesis, what would you predict about the cosine similarity between the embeddings of “awful” and “awesome”? What does this tell us about the relationship between distributional similarity and semantic similarity?

The cosine similarity will be very high (close to 1.0), because both words are adjectives that evaluate things
The cosine similarity between 'awful' and 'awesome' will likely be low-to-moderate, because their context words are very different — 'awful' co-occurs with negative evaluative adjectives, 'awesome' with positive ones. This illustrates an important limitation of the distributional hypothesis: it captures distributional similarity (similar contexts), which often tracks semantic similarity (synonyms, related concepts), but can miss antonyms. Antonyms like 'awful' and 'awesome' often appear in quite different contexts despite being from the same semantic domain (evaluation). Count-based embeddings tend to place antonyms further apart, though prediction-based models vary.
The cosine similarity will be exactly 0, because the two words have opposite meanings
It is impossible to predict the similarity without running the model

Part 2: Setup and Data

Section Overview

What you will learn: Which R packages are needed; how to install them; and how to load and preprocess text for embedding training

Installing packages

Code

install.packages("word2vec")      # word2vec training
install.packages("text2vec")      # alternative, faster for large corpora
install.packages("textdata")      # download pre-trained GloVe
install.packages("dplyr")
install.packages("stringr")
install.packages("tidyr")
install.packages("purrr")
install.packages("ggplot2")
install.packages("ggrepel")
install.packages("Rtsne")         # t-SNE dimensionality reduction
install.packages("here")
install.packages("flextable")
install.packages("checkdown")

Loading packages

Code

library(word2vec)
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
library(ggplot2)
library(ggrepel)
library(Rtsne)
library(here)
library(flextable)

Package overview

word2vec — trains word2vec models; the easiest entry point for beginners
text2vec — faster and more memory-efficient for large corpora
textdata — provides convenient access to pre-trained GloVe embeddings
Rtsne — implements t-SNE for 2D visualisation of high-dimensional embedding spaces

Loading example data

We use three public-domain literary texts — Alice’s Adventures in Wonderland, Moby Dick, and Pride and Prejudice — as our training corpus. These texts provide sufficient vocabulary diversity for demonstration purposes, though for serious research applications you would want a domain-specific corpus of at least 10 million words.

Code

alice <- readLines(here::here("tutorials/embeddings/data", "alice.txt"))
moby  <- readLines(here::here("tutorials/embeddings/data", "moby.txt"))
pride <- readLines(here::here("tutorials/embeddings/data", "pride.txt"))

# Combine into a single corpus string
corpus <- paste(c(alice, moby, pride), collapse = " ")

# Preprocessing: lowercase and normalise whitespace
corpus_clean <- corpus |>
  tolower() |>
  str_replace_all("\\s+", " ") |>
  str_trim()

cat("Corpus size:", str_count(corpus_clean, "\\S+"), "words\n")

Corpus size: 362385 words

Code

cat("First 200 characters:\n")

First 200 characters:

Code

cat(substr(corpus_clean, 1, 200), "...\n")

*** start of the project gutenberg ebook 11 *** [illustration] alice’s adventures in wonderland by lewis carroll the millennium fulcrum edition 3.0 contents chapter i. down the rabbit-hole chapter ii. ...

Preprocessing decisions

For this tutorial we apply minimal preprocessing: lowercasing and whitespace normalisation. Your own research may require different choices:

Keep case if named entities or proper nouns are important to your research questions
Keep punctuation if syntactic context is relevant (some embeddings benefit from sentence boundary information)
Remove or tag numbers depending on whether numeric values carry semantic content in your domain
Handle contractions consistently (e.g., expand don’t to do not, or keep as is)

There is no universal answer — the right preprocessing depends on your research question and corpus.

Part 3: Training a Word2Vec Model

Section Overview

What you will learn: How to format text for word2vec; how to train a model; and how to extract and inspect the embedding matrix

Formatting text for word2vec

word2vec requires a vector of sentences, not a single string

The word2vec() function processes text sentence by sentence. It must receive a character vector where each element is a sentence (or document). A single long string will cause a training error.

Code

# Split corpus into sentences
corpus_sentences <- corpus_clean |>
  str_split("\\.\\s+") |>
  unlist() |>
  discard(~ nchar(.x) == 0)

cat("Number of sentences:", length(corpus_sentences), "\n")

Number of sentences: 12980

Code

cat("Example sentence:", corpus_sentences[[5]], "\n")

Example sentence: the rabbit sends in a little bill chapter v

Training the model

Code

set.seed(42)

model <- word2vec(
  x         = corpus_sentences,
  type      = "skip-gram",
  dim       = 100,
  window    = 5,
  iter      = 20,
  min_count = 5,
  threads   = 2
)

# Inspect: show first 50 words in vocabulary
summary(model)[1:50]

 [1] "abundantly"   "acceptance"   "accompany"    "accounting"   "ache"        
 [6] "adhering"     "afar"         "aged"         "alacrity"     "alien"       
[11] "altar"        "amazed"       "amber"        "amends"       "anatomical"  
[16] "anchored"     "andes"        "animation"    "answers"      "antarctic"   
[21] "antique"      "apologize"    "apple"        "apply"        "arrested"    
[26] "arrow"        "artificial"   "ascribed"     "assertion"    "associations"
[31] "assuming"     "atmosphere"   "attacked"     "attempts"     "attractions" 
[36] "attribute"    "attributed"   "augment"      "authorities"  "axe"         
[41] "bade"         "banished"     "barb"         "barely"       "bats"        
[46] "beard"        "befell"       "behalf"       "bingleys"     "bitterly"

What happened during training:

The text was tokenised into words
Random initial vectors were assigned to all vocabulary words
For each target word, the model predicted context words within a window of 5
Prediction errors were used to update the embedding vectors
This process was repeated 20 times (iterations) across the entire corpus
Words that frequently shared contexts ended up with similar vectors

Extracting the embedding matrix

Code

embedding_matrix <- as.matrix(model)

cat("Embedding matrix:", nrow(embedding_matrix), "words x",
    ncol(embedding_matrix), "dimensions\n")

Embedding matrix: 5876 words x 100 dimensions

Code

# Look at a specific word's embedding
word_example <- "alice"
if (word_example %in% rownames(embedding_matrix)) {
  cat("\nFirst 10 dimensions of embedding for '", word_example, "':\n", sep = "")
  cat(round(embedding_matrix[word_example, 1:10], 4), "...\n")
}


First 10 dimensions of embedding for 'alice':
0.0558 1.1902 -1.1603 -1.8534 0.4619 -0.3369 0.3537 0.3286 -0.3606 -0.1591 ...

Each row of the embedding matrix represents one word as a 100-dimensional vector. The individual dimension values have no direct linguistic interpretation — what matters is the pattern of values across dimensions, and specifically how similar two patterns are to each other.

Part 4: Semantic Similarity

Section Overview

What you will learn: How to find the nearest neighbours of a word in embedding space; how to interpret cosine similarity scores; and how similarity in embedding space relates to (and differs from) lexical semantic similarity

Finding similar words

Code

# Find the 10 words most similar to "queen"
similar_to_queen <- predict(
  model,
  newdata = "queen",
  type    = "nearest",
  top_n   = 10
)

similar_to_queen |>
  as.data.frame() |>
  flextable() |>
  set_table_properties(width = 0.6, layout = "autofit") |>
  theme_zebra() |>
  set_caption("Top 10 words most similar to 'queen'") |>
  border_outer()

queen.term1	queen.term2	queen.similarity	queen.rank
queen	king	0.7626287	1
queen	croquet	0.7154154	2
queen	knave	0.7149584	3
queen	executioner	0.7108862	4
queen	hurriedly	0.7010316	5
queen	maria	0.6949580	6
queen	“get	0.6918876	7
queen	alice	0.6874384	8
queen	“mrs	0.6810983	9
queen	shrill	0.6805178	10

Visualising similarity scores

Code

similar_ship <- predict(model, newdata = "ship", type = "nearest", top_n = 12)

similar_ship |>
  as.data.frame() |>
  head(10) |>
  ggplot(aes(x = reorder(ship.term2, ship.similarity), y = ship.similarity)) +
  geom_bar(stat = "identity", fill = "gray50") +
  coord_flip() +
  labs(
    title = "Words most similar to 'ship'",
    x     = "Word",
    y     = "Cosine similarity"
  ) +
  theme_bw()

Exploring multiple words

Code

test_words <- c("love", "king", "ocean", "thought")

for (word in test_words) {
  if (word %in% rownames(embedding_matrix)) {
    similar <- predict(model, newdata = word, type = "nearest", top_n = 5)
    cat("\nMost similar to '", word, "':\n", sep = "")
    print(as.data.frame(similar)[, 2])
  }
}


Most similar to 'love':
[1] "marry"     "felicity"  "entertain" "girl"      "esteemed" 

Most similar to 'king':
[1] "queen"       "executioner" "hatter"      "“call"       "majesty"    

Most similar to 'ocean':
[1] "japanese"   "seas"       "sunset"     "blending"   "oftentimes"

Most similar to 'thought':
[1] "guessed"    "“perhaps"   "behaved"    "liveliness" "sharply"

Interpreting similarity results

Embedding similarity captures several different types of relationships simultaneously — synonymy, antonymy, topical association, and grammatical similarity can all contribute. Words that are similar in the embedding space share distributional contexts; this often correlates with semantic similarity but is not identical to it.

Notably, antonyms like love and hate may appear relatively similar in embedding space because they occupy parallel syntactic positions in sentences about emotion. This is a known limitation of static word embeddings trained purely on co-occurrence: they encode distributional pattern similarity, which only approximates semantic similarity.

Exercises: Semantic similarity

Q2. A researcher trains a word2vec model on a corpus of 19th-century English novels and finds that the words most similar to “gay” are: cheerful, merry, lively, bright, animated. A colleague trains a model on a corpus of 21st-century web text and finds the most similar words to be: lesbian, bisexual, queer, transgender, pride. What does this difference illustrate, and why is it important for linguistic research using word embeddings?

It illustrates that word2vec models are unreliable because they give different results for the same word
It illustrates that 19th-century corpora are too small to train reliable embeddings
This illustrates semantic change — the word 'gay' has shifted its primary meaning from 'cheerful, happy, lively' (its dominant sense in 19th-century usage) to referring to homosexuality (its dominant sense in contemporary usage). Because word embeddings capture distributional patterns from their training corpus, they effectively create a snapshot of word meaning as it exists in that corpus at that time. This makes embeddings powerful tools for studying diachronic semantic change: by training separate models on corpora from different time periods and comparing the nearest neighbours of the same word, researchers can quantify and document meaning shifts. This approach has been applied systematically in diachronic linguistics research.
It illustrates that the two models used different hyperparameters and are not comparable

Part 5: Word Analogies

Section Overview

What you will learn: How vector arithmetic implements word analogies; how to compute analogies manually; the conditions under which analogies work well; and how to visualise the geometric parallelism that makes analogies possible

Vector arithmetic for analogies

One of the most celebrated properties of word embeddings is that vector arithmetic can solve analogy tasks (Mikolov et al. 2013). The analogy “man is to king as woman is to ?” is solved by:

\[\text{target} = \text{embedding}(\textit{king}) - \text{embedding}(\textit{man}) + \text{embedding}(\textit{woman})\]

The word whose embedding is closest to this target vector should be queen. Geometrically, the vector from man to king captures the “royalty” direction in embedding space, and applying the same displacement to woman points toward queen.

Code

# Helper function: a is to b as c is to ?
# Mathematically: result = b - a + c
word_analogy <- function(model, a, b, c, top_n = 5) {
  embeddings <- as.matrix(model)

  if (!all(c(a, b, c) %in% rownames(embeddings))) {
    missing <- c(a, b, c)[!c(a, b, c) %in% rownames(embeddings)]
    stop(paste("Words not in vocabulary:", paste(missing, collapse = ", ")))
  }

  vec_a <- embeddings[a, ]
  vec_b <- embeddings[b, ]
  vec_c <- embeddings[c, ]

  # Compute the target vector
  target_vector <- vec_b - vec_a + vec_c

  # Cosine similarity with all vocabulary words
  similarities <- apply(embeddings, 1, function(wv) {
    sum(wv * target_vector) /
      (sqrt(sum(wv^2)) * sqrt(sum(target_vector^2)))
  })

  # Remove the input words from results
  similarities <- similarities[!names(similarities) %in% c(a, b, c)]

  top_words <- sort(similarities, decreasing = TRUE)[1:top_n]

  data.frame(
    word       = names(top_words),
    similarity = as.numeric(top_words),
    row.names  = NULL
  )
}

Code

# Classic analogy: man is to king as woman is to ?
analogy_result <- word_analogy(model, a = "man", b = "king", c = "woman", top_n = 5)

analogy_result |>
  flextable() |>
  set_table_properties(width = 0.5, layout = "autofit") |>
  theme_zebra() |>
  set_caption("king - man + woman = ?") |>
  border_outer()

word	similarity
remind	0.4813380
queen	0.4488754
majesty	0.4178134
girl	0.4117409
amusement	0.4009154

More analogy examples

Code

vocab <- rownames(as.matrix(model))

test_analogy <- function(model, a, b, c) {
  vocab <- rownames(as.matrix(model))
  label <- paste(a, ":", b, "::", c, ": ?")

  if (!all(c(a, b, c) %in% vocab)) {
    missing <- c(a, b, c)[!c(a, b, c) %in% vocab]
    cat(label, "\n  Words not in vocabulary:", paste(missing, collapse = ", "), "\n\n")
    return(invisible(NULL))
  }

  result <- word_analogy(model, a, b, c, top_n = 5)
  cat(label, "\n")
  cat("  Top results:", paste(result$word[1:5], collapse = ", "), "\n")
  cat("  Similarities:", paste(round(result$similarity[1:5], 3), collapse = ", "), "\n\n")
  return(invisible(result))
}

test_analogy(model, "queen", "woman", "man")

queen : woman :: man : ? 
  Top results: young, odor, tribe, girl, disagreeable 
  Similarities: 0.525, 0.492, 0.479, 0.473, 0.459

Code

test_analogy(model, "alice", "girl", "boy")

alice : girl :: boy : ? 
  Top results: dough, spoiled, schoolmaster, healthy, lies 
  Similarities: 0.501, 0.447, 0.44, 0.439, 0.402

Code

if (all(c("walking", "walk", "running") %in% vocab)) {
  test_analogy(model, "walk", "walking", "running")
}

walk : walking :: running : ? 
  Top results: died, jumping, bringing, flinging, drawn 
  Similarities: 0.417, 0.415, 0.412, 0.405, 0.403

Why analogies sometimes fail

Analogy results depend heavily on the size and diversity of the training corpus. Our literary corpus of approximately 500,000 words is far smaller than what is needed for robust analogy performance — serious analogy benchmarking requires 100 million words or more (Mikolov et al. 2013). Analogies work best when the relationship is represented consistently across many examples in the training data. The gender analogy (man/king/woman/queen) works well on large corpora because grammatical gender is pervasive; idiomatic or culture-specific relationships work much less reliably.

Visualising vector arithmetic

Code

vocab <- rownames(as.matrix(model))

if (all(c("man", "woman", "king", "queen") %in% vocab)) {
  embeddings <- as.matrix(model)
  words_of_interest <- c("man", "woman", "king", "queen")
  word_embeddings <- embeddings[words_of_interest, ]

  pca_result <- prcomp(word_embeddings, center = TRUE, scale. = FALSE)

  viz_data <- data.frame(
    word = words_of_interest,
    x    = pca_result$x[, 1],
    y    = pca_result$x[, 2]
  )

  ggplot(viz_data, aes(x = x, y = y, label = word)) +
    geom_point(size = 4, color = "steelblue") +
    geom_text_repel(size = 5, fontface = "bold") +
    geom_segment(
      data = viz_data[viz_data$word %in% c("man", "king"), ],
      aes(x = x[1], y = y[1], xend = x[2], yend = y[2]),
      arrow = arrow(length = unit(0.3, "cm")),
      color = "firebrick", linewidth = 1
    ) +
    geom_segment(
      data = viz_data[viz_data$word %in% c("woman", "queen"), ],
      aes(x = x[1], y = y[1], xend = x[2], yend = y[2]),
      arrow = arrow(length = unit(0.3, "cm")),
      color = "steelblue", linewidth = 1
    ) +
    theme_minimal() +
    labs(
      title    = "Vector arithmetic: parallel relationships",
      subtitle = "Red arrow (man to king) should parallel blue arrow (woman to queen)",
      x        = "First principal component",
      y        = "Second principal component"
    ) +
    theme(
      plot.title = element_text(size = 14, face = "bold"),
      axis.text  = element_blank(),
      panel.grid = element_blank()
    )
} else {
  cat("Words man, woman, king, queen not all in vocabulary — skipping visualisation.\n")
}

The two arrows should be roughly parallel and equal in length. This geometric parallelism is what makes vector-arithmetic analogies work: the transformation that turns man into king (in the embedding space) is approximately the same as the transformation that turns woman into queen.

Exercises: Word analogies

Q3. A researcher applies the analogy walking : walk :: running : ? to a word2vec model trained on a small corpus of 500,000 words and gets poor results (the top result is an unrelated word). She concludes that word2vec cannot capture morphological relationships. Is this conclusion justified? What are the actual reasons the analogy might fail, and what should she do instead?

The conclusion is justified — word2vec is fundamentally unable to capture morphological relationships
The conclusion is not justified. The failure of a specific analogy task on a small corpus does not mean word2vec cannot in principle capture morphological relationships — it means the conditions for reliable analogy performance were not met. Word2vec analogies require (1) a large corpus (100M+ words is ideal; 500K is far too small), (2) all three words to appear frequently enough to have well-estimated embeddings, and (3) the relationship to be consistently represented across many examples in the training data. On small corpora, rare word pairs have noisy, poorly estimated vectors, making analogy arithmetic unreliable. The researcher should either train on a much larger corpus, use pre-trained embeddings (GloVe or fastText), or use a morphologically aware model like fastText that handles subword information explicitly.
The conclusion is justified because word2vec uses whole-word representations and cannot see morphological structure
The conclusion is not justified, but the solution is simply to run more training iterations on the same corpus

Part 6: Visualising Embeddings

Section Overview

What you will learn: Why embeddings need dimensionality reduction for visualisation; how t-SNE works conceptually; how to run t-SNE in R; and how to interpret the resulting plot

The dimensionality challenge

Embedding spaces have 50–300 dimensions. Human perception is limited to 2–3 dimensions. To visualise where words sit relative to each other, we need to project the high-dimensional space down to 2D while preserving the local neighbourhood structure as faithfully as possible.

t-SNE (t-Distributed Stochastic Neighbor Embedding) is the most popular method for this purpose. It works by modelling the probability that two points are neighbours in the high-dimensional space, then finding a 2D configuration that matches those probabilities as closely as possible. Words that are close in 100D should be close in 2D; words that are far apart should remain far apart (approximately).

Preparing the data

Code

words_to_plot <- c(
  # Characters
  "alice", "queen", "king", "hatter", "rabbit",
  # Emotions
  "happy", "sad", "angry", "joy", "fear",
  # Actions
  "walk", "run", "jump", "sit", "stand",
  # Places
  "house", "garden", "forest", "city", "ocean",
  # Abstract
  "love", "hate", "hope", "dream", "thought"
)

# Keep only words in our vocabulary
words_to_plot  <- words_to_plot[words_to_plot %in% rownames(embedding_matrix)]
plot_embeddings <- embedding_matrix[words_to_plot, ]

cat("Words available for t-SNE plot:", length(words_to_plot), "\n")

Words available for t-SNE plot: 24

Running t-SNE

Code

set.seed(42)

tsne_result <- Rtsne(
  plot_embeddings,
  dims       = 2,
  perplexity = min(10, (nrow(plot_embeddings) - 1) / 3),
  theta      = 0.0,
  max_iter   = 1000
)

tsne_data <- data.frame(
  word     = words_to_plot,
  x        = tsne_result$Y[, 1],
  y        = tsne_result$Y[, 2],
  category = case_when(
    words_to_plot %in% c("alice", "queen", "king", "hatter", "rabbit") ~ "Characters",
    words_to_plot %in% c("happy", "sad", "angry", "joy", "fear")       ~ "Emotions",
    words_to_plot %in% c("walk", "run", "jump", "sit", "stand")        ~ "Actions",
    words_to_plot %in% c("house", "garden", "forest", "city", "ocean") ~ "Places",
    TRUE ~ "Abstract"
  )
)

Plotting the embedding space

Code

ggplot(tsne_data, aes(x = x, y = y, color = category, label = word)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_text_repel(size = 4, max.overlaps = 20, box.padding = 0.5) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title      = element_text(size = 16, face = "bold"),
    axis.text       = element_blank(),
    axis.ticks      = element_blank(),
    panel.grid      = element_blank()
  ) +
  labs(
    title    = "Word embeddings visualised with t-SNE",
    subtitle = "Semantically similar words cluster together",
    x        = NULL,
    y        = NULL,
    color    = "Category"
  )

Reading the plot:

Proximity indicates semantic similarity — words that are close together in the plot have similar embedding vectors, meaning they tend to appear in similar contexts
Clusters reflect shared distributional patterns — words from the same semantic category often cluster together
Absolute position is arbitrary — the specific coordinates have no meaning; only relative distances matter
t-SNE preserves local structure — words that are neighbours in the high-dimensional space will be neighbours in 2D, but distances between distant clusters are not reliably preserved

t-SNE parameter notes

The perplexity parameter (roughly: how many neighbours to consider) has a large effect on the visual appearance. Too low a value overemphasises local structure; too high a value loses it. A value of 5–50 is typically appropriate, with 30 being the most common default.

The theta parameter controls the speed-accuracy trade-off: 0.0 gives exact t-SNE (slow but accurate); 0.5 gives the Barnes-Hut approximation (much faster, good enough for large datasets).

Importantly, t-SNE is stochastic — running it twice with different seeds gives different plots. Always set a seed for reproducibility.

Part 7: Pre-Trained Embeddings

Section Overview

What you will learn: When to use pre-trained embeddings rather than training your own; how to download and use GloVe; and how to choose between GloVe, fastText, and BERT for your research question

When to use pre-trained embeddings

Pre-trained embeddings offer several advantages over training your own. They are trained on massive datasets — billions of words — giving much better coverage of rare words and more reliable semantic representations. They require no training time. And they have been validated on standard benchmarks and used in published research, making results easier to contextualise and compare.

You should train your own embeddings when:

Your corpus is in a specialised domain with vocabulary not well represented in general corpora (e.g. historical texts, clinical notes, legal documents, a specific language or dialect)
Your research question concerns the distributional patterns in a particular corpus (e.g. comparing across time periods or registers)
Pre-trained embeddings for your language are not available

GloVe embeddings

GloVe (Global Vectors for Word Representation) (Pennington, Socher, and Manning 2014) is trained on aggregated global word-word co-occurrence statistics from a large corpus. Unlike word2vec, which uses a local context window, GloVe explicitly factorises the word-word co-occurrence matrix, combining the advantages of global matrix factorisation and local window-based methods. GloVe embeddings trained on Wikipedia and Gigaword (6 billion tokens, 400K vocabulary) are widely used as a general-purpose baseline.

Code

library(textdata)

# Download 100-dimensional GloVe vectors (one-time, ~800 MB)
glove <- embedding_glove6b(dimensions = 100)

Code

# Rename columns for clarity
colnames(glove)[1] <- "word"
colnames(glove)[2:ncol(glove)] <- paste0("dim_", 1:100)

# Convert to matrix
glove_matrix <- as.matrix(glove[, -1])
rownames(glove_matrix) <- glove$word

# Find similar words
target_word    <- "linguistics"
target_vector  <- glove_matrix[target_word, ]

similarities <- apply(glove_matrix, 1, function(x) {
  sum(x * target_vector) /
    (sqrt(sum(x^2)) * sqrt(sum(target_vector^2)))
})

head(sort(similarities, decreasing = TRUE), 10)

For comprehensive background on neural network approaches to NLP, including the theory behind word embeddings, see Goldberg (2017). For an accessible introduction to transformer models and contextualised embeddings, see Tunstall, von Werra, and Wolf (2022).

Choosing between pre-trained models

Model	Trained on	Vocabulary	Dimensions	Best for
GloVe	Wikipedia + Gigaword (6B tokens)	400K	50–300	General English, linguistics research
fastText	Common Crawl (600B tokens)	2M+	300	Morphologically rich languages, rare words, misspellings
word2vec (Google News)	Google News (100B tokens)	3M	300	News domain
BERT	Wikipedia + BookCorpus	Contextual	768	Context-dependent tasks, polysemy

Choosing a model

GloVe is usually the best starting point for general linguistic research: it is simple to use, well-documented, and widely cited.

fastText (Bojanowski et al. 2017) handles out-of-vocabulary words by representing them as the sum of their character n-gram vectors. This makes it superior for morphologically complex languages (German, Finnish, Turkish), historical texts with spelling variation, and social media text with neologisms and misspellings.

BERT and other transformer models produce contextualised embeddings — the same word gets a different vector depending on its context. This is essential for tasks involving polysemy (word sense disambiguation), but adds substantial complexity and computational cost. See the LADAL Text Classification with BERT tutorial for an introduction.

Code

# fastText (requires fastrtext package)
library(fastrtext)
model_ft <- load_model("path/to/fasttext/model.bin")

# For BERT and other transformers, use the `text` package
library(text)
embeddings_bert <- textEmbed(
  texts = c("The bank is near the river", "I need to visit the bank"),
  model = "bert-base-uncased"
)
# "bank" gets DIFFERENT vectors in these two sentences

Part 8: Research Applications

Section Overview

What you will learn: How to apply embeddings to three substantive linguistic research questions: diachronic semantic change detection, gender bias measurement, and metaphor analysis

Semantic change detection

Tracking how word meanings shift over time is one of the most compelling applications of word embeddings in linguistics. By training separate models on corpora from different historical periods and comparing the nearest neighbours of a target word, researchers can document, date, and quantify semantic change (Hamilton, Leskovec, and Jurafsky 2016).

Hamilton, Leskovec & Jurafsky proposed two statistical laws of semantic change: the law of conformity (high-frequency words change more slowly) and the law of innovation (words with more senses change faster). Both laws were discovered by analysing diachronic embeddings at scale across multiple languages and centuries.

Code

# Conceptual example — requires historical corpora
corpus_1800s <- load_corpus("1800-1850")
corpus_2000s <- load_corpus("2000-2020")

model_1800s  <- word2vec(corpus_1800s, dim = 100, iter = 20)
model_2000s  <- word2vec(corpus_2000s, dim = 100, iter = 20)

target_word <- "gay"

# Neighbours in 1800s: cheerful, merry, lively, bright, festive
neighbors_1800s <- predict(model_1800s, target_word, type = "nearest", top_n = 10)

# Neighbours in 2000s: lesbian, queer, bisexual, pride, homosexual
neighbors_2000s <- predict(model_2000s, target_word, type = "nearest", top_n = 10)

# The shift in neighbourhood tells us when and how the meaning changed

Bias detection

Embeddings trained on large text corpora absorb the statistical regularities of human language use — including its biases. Research has shown that standard English word embeddings associate occupational terms with genders in ways that mirror historical stereotypes: doctor and engineer are closer to the male pole of the gender axis; nurse and secretary are closer to the female pole (Bolukbasi et al. 2016; Caliskan, Bryson, and Narayanan 2017; Garg et al. 2018).

Code

# Define a gender direction vector
man_vec   <- embedding_matrix["man", ]
woman_vec <- embedding_matrix["woman", ]
gender_direction <- woman_vec - man_vec

# Measure gender association of occupational terms
occupations <- c("doctor", "nurse", "engineer", "teacher", "programmer", "secretary")

occupation_bias <- sapply(occupations, function(occ) {
  if (occ %in% rownames(embedding_matrix)) {
    occ_vec <- embedding_matrix[occ, ]
    # Project onto gender direction
    sum(occ_vec * gender_direction) /
      (sqrt(sum(occ_vec^2)) * sqrt(sum(gender_direction^2)))
  } else NA
})

# Positive values = more female-associated; negative = more male-associated
sort(occupation_bias)

Ethical considerations when using embeddings

Embeddings encode and reproduce the biases present in their training data. This has practical consequences: systems that use embeddings for hiring, lending, or content recommendation can perpetuate and amplify historical inequalities even without explicit discriminatory intent.

For researchers, the key obligations are:

Acknowledge the limitations and potential biases of any embedding model used
Do not treat embedding-based associations as objective truths about the world — they reflect patterns in text, which reflect patterns of human behaviour and historical inequalities
When using embeddings in applied contexts, consider debiasing techniques (Bolukbasi et al. 2016), though note that these techniques are themselves imperfect and contested
Use diverse, balanced training corpora where possible

Metaphor analysis

Word embeddings can quantify cross-domain semantic associations that underlie conceptual metaphors. The conceptual metaphor IDEAS ARE LIGHT predicts that words from the light/illumination domain should be semantically similar to words from the knowledge/understanding domain. We can test this by computing the similarity matrix between source and target domain words (Jurafsky and Martin 2024).

Code

source_domain <- c("light", "bright", "illuminate", "shine", "glow")
target_domain <- c("idea", "thought", "insight", "knowledge", "understanding")

metaphor_matrix <- matrix(
  0,
  nrow     = length(source_domain),
  ncol     = length(target_domain),
  dimnames = list(source_domain, target_domain)
)

for (s in source_domain) {
  for (t in target_domain) {
    if (s %in% rownames(embedding_matrix) && t %in% rownames(embedding_matrix)) {
      sv <- embedding_matrix[s, ]
      tv <- embedding_matrix[t, ]
      metaphor_matrix[s, t] <- sum(sv * tv) /
        (sqrt(sum(sv^2)) * sqrt(sum(tv^2)))
    }
  }
}

# Visualise as a heatmap
library(pheatmap)
pheatmap(metaphor_matrix,
         main            = "IDEAS ARE LIGHT metaphor: cross-domain similarities",
         display_numbers = TRUE,
         number_format   = "%.2f")

Part 9: Advanced Topics and Practical Workflow

Section Overview

What you will learn: Tips for getting better embeddings; common training errors and their fixes; how to evaluate embedding quality; an overview of document and sentence embeddings; and a practical decision framework for real research projects

Getting better embeddings

Data quality and quantity are the most important factors. For word2vec, aim for at least 10 million words; 100 million is better. Clean your data carefully:

corpus_clean <- raw_text |>
  iconv(to = "UTF-8", sub = "") |>      # Fix encoding
  tolower() |>
  str_replace_all("http\\S+", " ") |>   # Remove URLs
  str_replace_all("\\d+", " ") |>       # Handle numbers
  str_replace_all("[^[:alnum:][:space:]]", " ") |>  # Remove punctuation
  str_replace_all("\\s+", " ") |>
  str_trim()

Common problems and solutions

“Training failed: fileMapper” error

The most common word2vec error. Cause: the text was passed as a single long string rather than a character vector of sentences. Fix: split into sentences first.

# Wrong
model <- word2vec(paste(texts, collapse = " "))

# Correct
sentences <- texts |> paste(collapse = " ") |> str_split("\\.\\s+") |> unlist()
model <- word2vec(sentences)

Poor quality results — increase corpus size; check preprocessing did not remove too much content; try more iterations; experiment with CBOW vs skip-gram.

Out-of-vocabulary words — lower min_count; use fastText (handles subwords); use a pre-trained model with larger vocabulary.

Slow training — reduce dimensions; use smaller window size; increase number of threads; consider text2vec package which is faster for large corpora.

Evaluating embeddings

Intrinsic evaluation measures how well embeddings capture human semantic judgements:

# Word similarity datasets: WordSim-353, SimLex-999
evaluate_similarity <- function(model, test_pairs) {
  model_scores <- sapply(1:nrow(test_pairs), function(i) {
    predict(model,
            newdata = c(test_pairs$word1[i], test_pairs$word2[i]),
            type    = "similarity")
  })
  cor(model_scores, test_pairs$human_score, method = "spearman")
}

Extrinsic evaluation measures performance on a downstream task (text classification, NER, sentiment analysis). This is generally more informative for applied research because it directly measures whether the embeddings are useful for your specific application.

Document embeddings

For document-level tasks (document similarity, clustering, classification), averaging word vectors is a simple and often effective approach. More principled alternatives include doc2vec (paragraph vectors) and sentence transformers.

# Simple document embedding: average of word vectors
doc_to_vector <- function(text, embedding_matrix) {
  words <- tolower(unlist(strsplit(text, "\\s+")))
  words <- words[words %in% rownames(embedding_matrix)]
  if (length(words) == 0) return(NULL)
  colMeans(embedding_matrix[words, , drop = FALSE])
}

# For higher-quality sentence embeddings, use the `text` package with BERT
library(text)
sentence_embeddings <- textEmbed(
  texts = your_sentences,
  model = "sentence-transformers/all-MiniLM-L6-v2"
)

Decision framework

Do you have a domain-specific corpus that differs substantially from general English?
  Yes: Should you train your own model?
    Large corpus (10M+ words)  → Train custom embeddings
    Small corpus               → Use pre-trained embeddings + fine-tuning if needed
  No: Use pre-trained embeddings
    General English tasks      → GloVe (simple, well-validated)
    Morphologically rich / rare words → fastText
    Context-dependent meaning  → BERT / sentence transformers

Is your research question diachronic (meaning change over time)?
  → Train separate models on period-specific corpora
  → Use alignment methods (Procrustes) to make spaces comparable

Is your research question about bias or social meaning?
  → Pre-trained models on large general corpora (GloVe, word2vec Google News)
  → These are better snapshots of general language use than small custom corpora

Reproducibility checklist

# Document everything so results can be reproduced
analysis_metadata <- list(
  date          = Sys.Date(),
  corpus        = "alice.txt + moby.txt + pride.txt",
  corpus_words  = str_count(corpus_clean, "\\S+"),
  preprocessing = list(lowercase = TRUE, remove_punct = FALSE, min_count = 5),
  model_params  = list(type = "skip-gram", dim = 100, window = 5, iter = 20),
  random_seed   = 42,
  r_version     = paste(R.version$major, R.version$minor, sep = ".")
)

saveRDS(analysis_metadata, "model_metadata.rds")
set.seed(42)  # Always set before training

Exercises: Advanced topics

Q4. A researcher wants to compare word meanings across three registers: academic writing, newspaper text, and Twitter posts. She plans to train a single word2vec model on a combined corpus of all three. A colleague suggests training three separate models instead. Which approach is better, and why?

Training a single combined model is better because it has more training data and will produce more reliable embeddings
Both approaches give identical results because word2vec is insensitive to the source of training data
Training three separate models is better for this research question. A single combined model will produce average embeddings that blend the distributional patterns from all three registers, obscuring the register-specific differences the researcher wants to study. For example, 'sick' means something different in academic writing (illness) and Twitter (impressive/cool); averaging these distributions will produce a blurred representation that captures neither clearly. By training separate models and comparing the nearest neighbours of target words across the three models, the researcher can directly observe and quantify register-specific semantic differences. The challenge is that separate models have differently oriented vector spaces and cannot be directly compared — alignment methods such as Procrustes rotation are needed to map the spaces into a common coordinate system before comparison.
Pre-trained GloVe embeddings should be used instead because they were trained on a broader corpus

Quick Reference

Essential functions

Code

# Training
model <- word2vec(x = sentence_vector, type = "skip-gram",
                  dim = 100, window = 5, iter = 20, min_count = 5)

# Finding similar words
similar <- predict(model, "king", type = "nearest", top_n = 10)

# Get embedding matrix
embeddings <- as.matrix(model)

# Cosine similarity between two words
cosine_sim <- function(a, b, mat) {
  sum(mat[a,] * mat[b,]) / (sqrt(sum(mat[a,]^2)) * sqrt(sum(mat[b,]^2)))
}

# Save and load model
write.word2vec(model, "model.bin")
model <- read.word2vec("model.bin")

# t-SNE visualisation
tsne_result <- Rtsne(word_subset_matrix, dims = 2, perplexity = 30,
                     theta = 0.5, max_iter = 1000, set.seed(42))

Common workflows

# Basic similarity analysis
text |> preprocess() |>
  word2vec(dim = 100) -> model

predict(model, "target_word", type = "nearest", top_n = 10)

# Visualisation pipeline
embeddings <- as.matrix(model)
words_subset <- embeddings[selected_words, ]
set.seed(42)
tsne_result <- Rtsne(words_subset, dims = 2, perplexity = 10)
plot_data <- data.frame(word = selected_words,
                         x = tsne_result$Y[,1], y = tsne_result$Y[,2])

# Word analogy
analogy_vector <- embeddings["king",] - embeddings["man",] + embeddings["woman",]
# Find nearest neighbour to analogy_vector in embedding space

Final Project Ideas

Capstone projects

Apply what you have learned with these research projects:

1. Historical semantic change Collect texts from different decades (e.g. from Project Gutenberg for historical periods; from newspaper archives for recent decades). Train separate embedding models. Track the nearest neighbours of target words over time. Visualise changes and identify the approximate date of semantic shifts.

2. Domain-specific terminology extraction Gather a specialised corpus (medical, legal, technical, or a specific academic field). Train custom embeddings. Use the nearest neighbours of known domain terms as seeds to extract further domain-specific vocabulary. Compare to a general English model.

3. Register comparison Compare embeddings trained on different registers (formal vs. informal, spoken vs. written, academic vs. popular). Use Procrustes alignment to make the models comparable. Identify words whose meanings are most register-specific.

4. Bias audit Load pre-trained GloVe or word2vec embeddings. Define a gender or ethnicity direction vector. Project occupational terms onto this direction. Quantify associations and compare to historical data on occupational demographics. Replicate or extend the methodology of Garg et al. (2018).

5. Metaphor mapping Identify a conceptual metaphor (e.g. ARGUMENT IS WAR, TIME IS MONEY, LIFE IS A JOURNEY). Define source and target domain vocabulary. Compute the cross-domain similarity matrix using your embeddings. Compare across languages or registers.

Suggested deliverables: A documented R script, at least two visualisations, and a brief report (1000–1500 words) interpreting and discussing your findings.

Citation & Session Info

Citation

@manual{martinschweinberger2026word,
  author       = {Martin Schweinberger},
  title        = {Word Embeddings and Vector Semantics},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/embeddings_tutorial/embeddings_tutorial.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2026.03.27}
  doi      = {}
}

AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to substantially revise and expand an earlier tutorial by Martin Schweinberger. All references were verified independently before BibTeX entries were generated. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] checkdown_0.0.13 flextable_0.9.11 here_1.0.2       Rtsne_0.17      
 [5] ggrepel_0.9.8    ggplot2_4.0.2    purrr_1.2.1      tidyr_1.3.2     
 [9] stringr_1.6.0    dplyr_1.2.0      text2vec_0.6.4   word2vec_0.4.1  

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.56               htmlwidgets_1.6.4      
 [4] lattice_0.22-6          vctrs_0.7.2             tools_4.4.2            
 [7] generics_0.1.4          tibble_3.3.1            pkgconfig_2.0.3        
[10] Matrix_1.7-2            data.table_1.17.0       RColorBrewer_1.1-3     
[13] S7_0.2.1                uuid_1.2-1              lifecycle_1.0.5        
[16] compiler_4.4.2          farver_2.1.2            textshaping_1.0.0      
[19] RhpcBLASctl_0.23-42     codetools_0.2-20        litedown_0.9           
[22] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.9        
[25] yaml_2.3.10             pillar_1.11.1           crayon_1.5.3           
[28] openssl_2.3.2           rsparse_0.5.3           fontBitstreamVera_0.1.1
[31] commonmark_2.0.0        tidyselect_1.2.1        zip_2.3.2              
[34] digest_0.6.39           stringi_1.8.7           labeling_0.4.3         
[37] rprojroot_2.1.1         fastmap_1.2.0           grid_4.4.2             
[40] cli_3.6.5               magrittr_2.0.4          patchwork_1.3.0        
[43] withr_3.0.2             gdtools_0.5.0           scales_1.4.0           
[46] float_0.3-2             rmarkdown_2.30          officer_0.7.3          
[49] mlapi_0.1.1             askpass_1.2.1           ragg_1.5.1             
[52] evaluate_1.0.5          knitr_1.51              markdown_2.0           
[55] rlang_1.1.7             Rcpp_1.1.1              glue_1.8.0             
[58] BiocManager_1.30.27     xml2_1.3.6              renv_1.1.7             
[61] rstudioapi_0.17.1       jsonlite_2.0.0          lgr_0.4.4              
[64] R6_2.6.1                systemfonts_1.3.1

Back to LADAL home

References

Almeida, Felipe, and Geraldo Xexéo. 2019. “Word Embeddings: A Survey.” https://doi.org/10.48550/arXiv.1901.09069.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5: 135–46. https://doi.org/10.1162/tacl_a_00051.

Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” In Advances in Neural Information Processing Systems 29 (NeurIPS 2016), 4349–57. Barcelona, Spain. https://proceedings.neurips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html.

Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. 2017. “Semantics Derived Automatically from Language Corpora Contain Human-Like Biases.” Science 356 (6334): 183–86. https://doi.org/10.1126/science.aal4230.

Firth, John R. 1957. “A Synopsis of Linguistic Theory, 1930–1955.” In Studies in Linguistic Analysis, 1–32. Oxford: Blackwell.

Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115 (16): E3635–44. https://doi.org/10.1073/pnas.1720347115.

Goldberg, Yoav. 2017. Neural Network Methods for Natural Language Processing. Synthesis Lectures on Human Language Technologies 37. Morgan & Claypool Publishers. https://doi.org/10.2200/S00762ED1V01Y201703HLT037.

Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016. “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1489–1501. Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1141.

Jurafsky, Daniel, and James H. Martin. 2024. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Third Edition (draft). https://web.stanford.edu/~jurafsky/slp3/.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” In Proceedings of the 1st International Conference on Learning Representations (ICLR 2013), Workshop Track. Scottsdale, AZ. https://doi.org/10.48550/arXiv.1301.3781.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.

Tunstall, Lewis, Leandro von Werra, and Thomas Wolf. 2022. Natural Language Processing with Transformers. Revised. Sebastopol, CA: O’Reilly Media.

--- title: "Word Embeddings and Vector Semantics" author: "Martin Schweinberger" date: "2026" params: title: "Word Embeddings and Vector Semantics" author: "Martin Schweinberger" year: "2026" version: "2026.03.27" url: "https://ladal.edu.au/tutorials/embeddings_tutorial/embeddings_tutorial.html" institution: "The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia" doi: "" format: html: toc: true toc-depth: 4 code-fold: show code-tools: true theme: cosmo --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} options(stringsAsFactors = FALSE) options(scipen = 999) library(checkdown) ``` ![](/images/uq1.jpg){ width=100% } # Introduction {#intro} ![](/images/gy_chili.png){ width=15% style="float:right; padding:10px" } This tutorial introduces **word embeddings** — dense vector representations of words that capture semantic meaning — and demonstrates how to train, explore, visualise, and apply them in R. Word embeddings are one of the most influential ideas in modern computational linguistics and natural language processing. They underpin everything from search engines and machine translation to sentiment analysis and the large language models that have become central to research workflows. The tutorial covers the theoretical foundations of vector semantics, the mechanics of the word2vec algorithm, hands-on training of embedding models from raw text, finding semantically similar words, performing word analogies, visualising embedding spaces with t-SNE, using pre-trained embeddings such as GloVe and fastText, and applying embeddings to research questions in linguistics including semantic change, bias detection, and metaphor analysis. ::: {.callout-note} ## Learning Objectives By the end of this tutorial you will be able to: 1. Explain the distributional hypothesis and its role as the theoretical foundation of word embeddings 2. Distinguish between count-based, prediction-based, and contextualised embedding approaches 3. Describe how the word2vec skip-gram algorithm learns word vectors from text 4. Train a word2vec model in R using the `word2vec` package 5. Find semantically similar words using cosine similarity 6. Perform word analogies via vector arithmetic and interpret the results 7. Visualise high-dimensional embeddings in 2D using t-SNE 8. Load and query pre-trained GloVe embeddings 9. Apply embeddings to linguistic research questions including semantic change and bias detection 10. Choose between training custom embeddings and using pre-trained models for a given task ::: ::: {.callout-note} ## Prerequisite Tutorials Before working through this tutorial, you should be comfortable with the content of: - [Getting Started with R](/tutorials/intror/intror.html) - [String Processing in R](/tutorials/string/string.html) - [Basic Statistics in R](/tutorials/basicstatz/basicstatz.html) - [Working with Computers](/tutorials/comp/comp.html) A basic understanding of what vectors and matrices are is helpful, but no advanced linear algebra is required. ::: ::: {.callout-note} ## Citation ```{r citation-callout-top, echo=FALSE, results='asis'} cat( params$author, ". ", params$year, ". *", params$title, "*. ", params$institution, ". ", "url: ", params$url, " ", "(Version ", params$version, "), ", "doi: ", params$doi, ".", sep = "" ) ``` ::: --- # Part 1: The Theory of Word Embeddings {#part1} ::: {.callout-note} ## Section Overview **What you will learn:** The distributional hypothesis; why traditional word representations fail; the geometry of semantic space; and the three main families of word embeddings ::: ## The distributional hypothesis {-} The theoretical foundation of word embeddings is a deceptively simple idea known as the **distributional hypothesis**, articulated most memorably by the British linguist J. R. Firth: > "You shall know a word by the company it keeps" [@firth1957synopsis, p. 11] The core claim is that words appearing in similar contexts have similar meanings. Consider three sentences: "The **cat** sat on the mat," "The **dog** sat on the mat," and "The **car** drove down the street." Because *cat* and *dog* share context words (sat, mat) while *car* does not, the distributional hypothesis predicts that *cat* and *dog* should be semantically closer to each other than either is to *car*. Word embeddings operationalise this prediction by assigning each word a position in a multidimensional space such that distributional similarity translates into geometric proximity. This is a linguistically rich idea with deep roots. The distributional hypothesis connects to structuralist notions of paradigmatic and syntagmatic relations, to corpus-based approaches to lexical meaning, and to usage-based theories that treat meaning as emerging from patterns of use in context [@jurafsky2024speech, ch. 6]. The fact that meaning can be approximated from co-occurrence statistics alone — without any hand-crafted semantic knowledge — has proven to be one of the most powerful ideas in computational linguistics. ## The problem with traditional representations {-} Before embeddings became widespread, words were typically represented as **one-hot vectors**: binary vectors with a 1 in the position corresponding to that word and 0 everywhere else. ``` cat = [1, 0, 0, 0, 0, ..., 0] (10,000 dimensions) dog = [0, 1, 0, 0, 0, ..., 0] car = [0, 0, 1, 0, 0, ..., 0] ``` This representation has several fatal problems for semantic tasks. First, there is no way to measure similarity: *cat* and *dog* are as different from each other (cosine similarity = 0) as *cat* and *car*. Second, the representations are extremely sparse — almost entirely zeros — which is computationally wasteful. Third, the dimensionality grows with vocabulary size, reaching tens or hundreds of thousands of dimensions for any realistic corpus. **Word embeddings** solve all of these problems: ``` cat = [0.2, -0.4, 0.7, ..., 0.1] (100–300 dimensions) dog = [0.3, -0.5, 0.8, ..., 0.2] (similar to cat) car = [-0.1, 0.6, -0.3, ..., 0.4] (different from cat/dog) ``` The representations are dense (most values non-zero), low-dimensional (typically 50–300), and semantically structured: similar words cluster together and the geometric relationships between vectors encode meaningful semantic relationships. ## Vector space geometry {-} In an embedding space, **similarity is measured by cosine similarity** — the cosine of the angle between two vectors: $$\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}$$ Cosine similarity ranges from –1 (opposite directions) to +1 (identical direction). For word embeddings, values between 0.3 and 0.9 are typical for semantically related words. One of the most striking properties of well-trained embeddings is that **vector arithmetic preserves semantic relationships**. The famous example [@mikolov2013efficient]: $$\text{embedding}(\textit{king}) - \text{embedding}(\textit{man}) + \text{embedding}(\textit{woman}) \approx \text{embedding}(\textit{queen})$$ Geometrically, this says that the direction from *man* to *king* (representing something like "royalty" or "leadership relative to gender") is roughly the same as the direction from *woman* to *queen*. This parallelism reflects the genuine regularity of the relationship in the training data. ## Types of word embeddings {-} Three broad approaches to learning word embeddings have been developed [@almeida2019word]: **Count-based methods (classical)** — build a co-occurrence matrix counting how often each pair of words appears within a context window, then apply dimensionality reduction (typically Singular Value Decomposition) to obtain dense vectors. Examples include Latent Semantic Analysis (LSA) and Hyperspace Analogue to Language (HAL). These methods are transparent and interpretable, but less effective than prediction-based methods for most tasks. **Prediction-based methods (modern)** — train a neural network to predict context words from a target word (skip-gram) or a target word from its context words (CBOW). The embedding vectors are the learned weights of this network. Examples include word2vec [@mikolov2013efficient], GloVe [@pennington2014glove], and fastText [@bojanowski2017enriching]. These are the most widely used methods in linguistic research. **Contextualised embeddings (current frontier)** — produce different vectors for the same word depending on its context, capturing polysemy and disambiguation. Examples include ELMo, BERT, and GPT. The same word *bank* gets a different vector in "river bank" than in "savings bank." These models are state-of-the-art for most NLP tasks but require substantially more computational resources. This tutorial focuses on word2vec and GloVe, which remain the methods of choice for most linguistic research applications. ## The word2vec algorithm {-} Word2vec [@mikolov2013efficient] introduced two architectures for learning embeddings: **Skip-gram** — given a target word, predict the surrounding context words. The network learns to associate each target word with the types of contexts it appears in. **CBOW (Continuous Bag of Words)** — given the surrounding context words, predict the target word. Generally faster to train and works better for frequent words; skip-gram works better for rare words and smaller datasets. The **skip-gram training process** works as follows. For the sentence "The quick brown fox jumps," with target word *brown* and a window of 2: Training pairs generated: (*brown*, *the*), (*brown*, *quick*), (*brown*, *fox*), (*brown*, *jumps*) The neural network learns to maximise the probability of predicting these context words from the target word, adjusting the embedding vectors at each step to reduce prediction error. After training on many such pairs across the corpus, words that frequently share contexts end up with similar vectors. Key hyperparameters are summarised below: | Parameter | What it controls | Typical values | |---|---|---| | `dim` | Embedding dimensions | 50–300 | | `window` | Context window size | 5–10 | | `min_count` | Minimum word frequency | 5–10 | | `type` | Architecture | `"skip-gram"` or `"cbow"` | | `iter` | Training iterations | 10–50 | ::: {.callout-warning} ## Hyperparameter trade-offs More dimensions capture more nuance but risk overfitting and slow training. A larger window captures broader topical relationships; a smaller window captures tighter syntactic relationships. More iterations improve learning but with diminishing returns. A good starting point is `dim = 100`, `window = 5`, `iter = 20`, then experiment based on results. ::: ::: {.callout-tip} ## Exercises: Theory ::: **Q1. The word2vec model is trained on a corpus where the word "awful" appears frequently in contexts alongside words like "terrible," "dreadful," and "horrible." The word "awesome" appears frequently alongside "amazing," "incredible," and "brilliant." Based on the distributional hypothesis, what would you predict about the cosine similarity between the embeddings of "awful" and "awesome"? What does this tell us about the relationship between distributional similarity and semantic similarity?** ```{r} #| echo: false #| label: "theory_q1" check_question( "The cosine similarity between 'awful' and 'awesome' will likely be low-to-moderate, because their context words are very different — 'awful' co-occurs with negative evaluative adjectives, 'awesome' with positive ones. This illustrates an important limitation of the distributional hypothesis: it captures distributional similarity (similar contexts), which often tracks semantic similarity (synonyms, related concepts), but can miss antonyms. Antonyms like 'awful' and 'awesome' often appear in quite different contexts despite being from the same semantic domain (evaluation). Count-based embeddings tend to place antonyms further apart, though prediction-based models vary.", options = c( "The cosine similarity will be very high (close to 1.0), because both words are adjectives that evaluate things", "The cosine similarity between 'awful' and 'awesome' will likely be low-to-moderate, because their context words are very different — 'awful' co-occurs with negative evaluative adjectives, 'awesome' with positive ones. This illustrates an important limitation of the distributional hypothesis: it captures distributional similarity (similar contexts), which often tracks semantic similarity (synonyms, related concepts), but can miss antonyms. Antonyms like 'awful' and 'awesome' often appear in quite different contexts despite being from the same semantic domain (evaluation). Count-based embeddings tend to place antonyms further apart, though prediction-based models vary.", "The cosine similarity will be exactly 0, because the two words have opposite meanings", "It is impossible to predict the similarity without running the model" ), type = "radio", q_id = "theory_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The distributional hypothesis predicts similarity based on context patterns, not semantic category membership. 'Awful' and 'awesome' appear in systematically different contexts — one with negative evaluative vocabulary, one with positive — so their embeddings will likely be dissimilar. This is actually a well-known limitation: embeddings trained purely on co-occurrence statistics often fail to reliably distinguish synonyms from antonyms, because antonyms sometimes appear in similar syntactic frames (e.g., 'the movie was awful/awesome'). Supervised methods or additional training objectives are needed to capture this distinction reliably.", wrong = "Not quite. The distributional hypothesis is about context patterns, not part-of-speech or semantic domain. 'Awful' and 'awesome' appear in systematically different contexts — negative versus positive evaluative vocabulary — so their embeddings will be dissimilar. The hypothesis does not guarantee that antonyms will be similar just because they belong to the same broad category. In fact, a known limitation of distributional embeddings is that they can struggle to reliably separate antonyms from synonyms." ) ``` --- # Part 2: Setup and Data {#part2} ::: {.callout-note} ## Section Overview **What you will learn:** Which R packages are needed; how to install them; and how to load and preprocess text for embedding training ::: ## Installing packages {-} ```{r install, eval=FALSE} install.packages("word2vec") # word2vec training install.packages("text2vec") # alternative, faster for large corpora install.packages("textdata") # download pre-trained GloVe install.packages("dplyr") install.packages("stringr") install.packages("tidyr") install.packages("purrr") install.packages("ggplot2") install.packages("ggrepel") install.packages("Rtsne") # t-SNE dimensionality reduction install.packages("here") install.packages("flextable") install.packages("checkdown") ``` ## Loading packages {-} ```{r load-packages, message=FALSE, warning=FALSE} library(word2vec) library(dplyr) library(stringr) library(tidyr) library(purrr) library(ggplot2) library(ggrepel) library(Rtsne) library(here) library(flextable) ``` ::: {.callout-note} ## Package overview - `word2vec` — trains word2vec models; the easiest entry point for beginners - `text2vec` — faster and more memory-efficient for large corpora - `textdata` — provides convenient access to pre-trained GloVe embeddings - `Rtsne` — implements t-SNE for 2D visualisation of high-dimensional embedding spaces ::: ## Loading example data {-} We use three public-domain literary texts — *Alice's Adventures in Wonderland*, *Moby Dick*, and *Pride and Prejudice* — as our training corpus. These texts provide sufficient vocabulary diversity for demonstration purposes, though for serious research applications you would want a domain-specific corpus of at least 10 million words. ```{r load-data, message=FALSE, warning=FALSE} alice <- readLines(here::here("tutorials/embeddings/data", "alice.txt")) moby <- readLines(here::here("tutorials/embeddings/data", "moby.txt")) pride <- readLines(here::here("tutorials/embeddings/data", "pride.txt")) # Combine into a single corpus string corpus <- paste(c(alice, moby, pride), collapse = " ") # Preprocessing: lowercase and normalise whitespace corpus_clean <- corpus |> tolower() |> str_replace_all("\\s+", " ") |> str_trim() cat("Corpus size:", str_count(corpus_clean, "\\S+"), "words\n") cat("First 200 characters:\n") cat(substr(corpus_clean, 1, 200), "...\n") ``` ::: {.callout-note} ## Preprocessing decisions For this tutorial we apply minimal preprocessing: lowercasing and whitespace normalisation. Your own research may require different choices: - **Keep case** if named entities or proper nouns are important to your research questions - **Keep punctuation** if syntactic context is relevant (some embeddings benefit from sentence boundary information) - **Remove or tag numbers** depending on whether numeric values carry semantic content in your domain - **Handle contractions** consistently (e.g., expand *don't* to *do not*, or keep as is) There is no universal answer — the right preprocessing depends on your research question and corpus. ::: --- # Part 3: Training a Word2Vec Model {#part3} ::: {.callout-note} ## Section Overview **What you will learn:** How to format text for word2vec; how to train a model; and how to extract and inspect the embedding matrix ::: ## Formatting text for word2vec {-} ::: {.callout-important} ## word2vec requires a vector of sentences, not a single string The `word2vec()` function processes text sentence by sentence. It must receive a character vector where each element is a sentence (or document). A single long string will cause a training error. ::: ```{r prepare-sentences, message=FALSE, warning=FALSE} # Split corpus into sentences corpus_sentences <- corpus_clean |> str_split("\\.\\s+") |> unlist() |> discard(~ nchar(.x) == 0) cat("Number of sentences:", length(corpus_sentences), "\n") cat("Example sentence:", corpus_sentences[[5]], "\n") ``` ## Training the model {-} ```{r train-model, message=FALSE, warning=FALSE} set.seed(42) model <- word2vec( x = corpus_sentences, type = "skip-gram", dim = 100, window = 5, iter = 20, min_count = 5, threads = 2 ) # Inspect: show first 50 words in vocabulary summary(model)[1:50] ``` **What happened during training:** 1. The text was tokenised into words 2. Random initial vectors were assigned to all vocabulary words 3. For each target word, the model predicted context words within a window of 5 4. Prediction errors were used to update the embedding vectors 5. This process was repeated 20 times (iterations) across the entire corpus 6. Words that frequently shared contexts ended up with similar vectors ## Extracting the embedding matrix {-} ```{r extract-embeddings, message=FALSE, warning=FALSE} embedding_matrix <- as.matrix(model) cat("Embedding matrix:", nrow(embedding_matrix), "words x", ncol(embedding_matrix), "dimensions\n") # Look at a specific word's embedding word_example <- "alice" if (word_example %in% rownames(embedding_matrix)) { cat("\nFirst 10 dimensions of embedding for '", word_example, "':\n", sep = "") cat(round(embedding_matrix[word_example, 1:10], 4), "...\n") } ``` Each row of the embedding matrix represents one word as a 100-dimensional vector. The individual dimension values have no direct linguistic interpretation — what matters is the *pattern* of values across dimensions, and specifically how similar two patterns are to each other. --- # Part 4: Semantic Similarity {#part4} ::: {.callout-note} ## Section Overview **What you will learn:** How to find the nearest neighbours of a word in embedding space; how to interpret cosine similarity scores; and how similarity in embedding space relates to (and differs from) lexical semantic similarity ::: ## Finding similar words {-} ```{r similar-words, message=FALSE, warning=FALSE} # Find the 10 words most similar to "queen" similar_to_queen <- predict( model, newdata = "queen", type = "nearest", top_n = 10 ) similar_to_queen |> as.data.frame() |> flextable() |> set_table_properties(width = 0.6, layout = "autofit") |> theme_zebra() |> set_caption("Top 10 words most similar to 'queen'") |> border_outer() ``` ## Visualising similarity scores {-} ```{r similarity-plot, message=FALSE, warning=FALSE} similar_ship <- predict(model, newdata = "ship", type = "nearest", top_n = 12) similar_ship |> as.data.frame() |> head(10) |> ggplot(aes(x = reorder(ship.term2, ship.similarity), y = ship.similarity)) + geom_bar(stat = "identity", fill = "gray50") + coord_flip() + labs( title = "Words most similar to 'ship'", x = "Word", y = "Cosine similarity" ) + theme_bw() ``` ## Exploring multiple words {-} ```{r explore-words, message=FALSE, warning=FALSE} test_words <- c("love", "king", "ocean", "thought") for (word in test_words) { if (word %in% rownames(embedding_matrix)) { similar <- predict(model, newdata = word, type = "nearest", top_n = 5) cat("\nMost similar to '", word, "':\n", sep = "") print(as.data.frame(similar)[, 2]) } } ``` ::: {.callout-note} ## Interpreting similarity results Embedding similarity captures several different types of relationships simultaneously — synonymy, antonymy, topical association, and grammatical similarity can all contribute. Words that are similar in the embedding space share distributional contexts; this often correlates with semantic similarity but is not identical to it. Notably, antonyms like *love* and *hate* may appear relatively similar in embedding space because they occupy parallel syntactic positions in sentences about emotion. This is a known limitation of static word embeddings trained purely on co-occurrence: they encode distributional pattern similarity, which only approximates semantic similarity. ::: ::: {.callout-tip} ## Exercises: Semantic similarity ::: **Q2. A researcher trains a word2vec model on a corpus of 19th-century English novels and finds that the words most similar to "gay" are: cheerful, merry, lively, bright, animated. A colleague trains a model on a corpus of 21st-century web text and finds the most similar words to be: lesbian, bisexual, queer, transgender, pride. What does this difference illustrate, and why is it important for linguistic research using word embeddings?** ```{r} #| echo: false #| label: "similarity_q2" check_question( "This illustrates semantic change — the word 'gay' has shifted its primary meaning from 'cheerful, happy, lively' (its dominant sense in 19th-century usage) to referring to homosexuality (its dominant sense in contemporary usage). Because word embeddings capture distributional patterns from their training corpus, they effectively create a snapshot of word meaning as it exists in that corpus at that time. This makes embeddings powerful tools for studying diachronic semantic change: by training separate models on corpora from different time periods and comparing the nearest neighbours of the same word, researchers can quantify and document meaning shifts. This approach has been applied systematically in diachronic linguistics research.", options = c( "It illustrates that word2vec models are unreliable because they give different results for the same word", "It illustrates that 19th-century corpora are too small to train reliable embeddings", "This illustrates semantic change — the word 'gay' has shifted its primary meaning from 'cheerful, happy, lively' (its dominant sense in 19th-century usage) to referring to homosexuality (its dominant sense in contemporary usage). Because word embeddings capture distributional patterns from their training corpus, they effectively create a snapshot of word meaning as it exists in that corpus at that time. This makes embeddings powerful tools for studying diachronic semantic change: by training separate models on corpora from different time periods and comparing the nearest neighbours of the same word, researchers can quantify and document meaning shifts. This approach has been applied systematically in diachronic linguistics research.", "It illustrates that the two models used different hyperparameters and are not comparable" ), type = "radio", q_id = "similarity_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! This is exactly the kind of evidence for semantic change that diachronic embedding methods were designed to detect and quantify. Researchers such as Hamilton, Leskovec & Jurafsky have developed rigorous methods for comparing embeddings across time periods and documenting the statistical laws that govern semantic change. The key insight is that embeddings are not objective descriptions of word meaning — they are snapshots of distributional patterns in a particular corpus. Treating them as such makes them powerful instruments for historical semantics.", wrong = "Not quite. The differences between the two models' outputs are not methodological artefacts — they are meaningful signals about how language use has changed. Word embeddings capture distributional patterns from their training data. When a word's distributional context changes systematically over time (because its meaning has shifted), embeddings trained on corpora from different periods will reflect that change. This is the basis of the entire field of diachronic word embedding research." ) ``` --- # Part 5: Word Analogies {#part5} ::: {.callout-note} ## Section Overview **What you will learn:** How vector arithmetic implements word analogies; how to compute analogies manually; the conditions under which analogies work well; and how to visualise the geometric parallelism that makes analogies possible ::: ## Vector arithmetic for analogies {-} One of the most celebrated properties of word embeddings is that vector arithmetic can solve analogy tasks [@mikolov2013efficient]. The analogy "man is to king as woman is to ?" is solved by: $$\text{target} = \text{embedding}(\textit{king}) - \text{embedding}(\textit{man}) + \text{embedding}(\textit{woman})$$ The word whose embedding is closest to this target vector should be *queen*. Geometrically, the vector from *man* to *king* captures the "royalty" direction in embedding space, and applying the same displacement to *woman* points toward *queen*. ```{r analogy-function, message=FALSE, warning=FALSE} # Helper function: a is to b as c is to ? # Mathematically: result = b - a + c word_analogy <- function(model, a, b, c, top_n = 5) { embeddings <- as.matrix(model) if (!all(c(a, b, c) %in% rownames(embeddings))) { missing <- c(a, b, c)[!c(a, b, c) %in% rownames(embeddings)] stop(paste("Words not in vocabulary:", paste(missing, collapse = ", "))) } vec_a <- embeddings[a, ] vec_b <- embeddings[b, ] vec_c <- embeddings[c, ] # Compute the target vector target_vector <- vec_b - vec_a + vec_c # Cosine similarity with all vocabulary words similarities <- apply(embeddings, 1, function(wv) { sum(wv * target_vector) / (sqrt(sum(wv^2)) * sqrt(sum(target_vector^2))) }) # Remove the input words from results similarities <- similarities[!names(similarities) %in% c(a, b, c)] top_words <- sort(similarities, decreasing = TRUE)[1:top_n] data.frame( word = names(top_words), similarity = as.numeric(top_words), row.names = NULL ) } ``` ```{r run-analogy, message=FALSE, warning=FALSE} # Classic analogy: man is to king as woman is to ? analogy_result <- word_analogy(model, a = "man", b = "king", c = "woman", top_n = 5) analogy_result |> flextable() |> set_table_properties(width = 0.5, layout = "autofit") |> theme_zebra() |> set_caption("king - man + woman = ?") |> border_outer() ``` ## More analogy examples {-} ```{r more-analogies, message=FALSE, warning=FALSE} vocab <- rownames(as.matrix(model)) test_analogy <- function(model, a, b, c) { vocab <- rownames(as.matrix(model)) label <- paste(a, ":", b, "::", c, ": ?") if (!all(c(a, b, c) %in% vocab)) { missing <- c(a, b, c)[!c(a, b, c) %in% vocab] cat(label, "\n Words not in vocabulary:", paste(missing, collapse = ", "), "\n\n") return(invisible(NULL)) } result <- word_analogy(model, a, b, c, top_n = 5) cat(label, "\n") cat(" Top results:", paste(result$word[1:5], collapse = ", "), "\n") cat(" Similarities:", paste(round(result$similarity[1:5], 3), collapse = ", "), "\n\n") return(invisible(result)) } test_analogy(model, "queen", "woman", "man") test_analogy(model, "alice", "girl", "boy") if (all(c("walking", "walk", "running") %in% vocab)) { test_analogy(model, "walk", "walking", "running") } ``` ::: {.callout-warning} ## Why analogies sometimes fail Analogy results depend heavily on the size and diversity of the training corpus. Our literary corpus of approximately 500,000 words is far smaller than what is needed for robust analogy performance — serious analogy benchmarking requires 100 million words or more [@mikolov2013efficient]. Analogies work best when the relationship is represented consistently across many examples in the training data. The gender analogy (man/king/woman/queen) works well on large corpora because grammatical gender is pervasive; idiomatic or culture-specific relationships work much less reliably. ::: ## Visualising vector arithmetic {-} ```{r analogy-viz, message=FALSE, warning=FALSE, fig.width=8, fig.height=6} vocab <- rownames(as.matrix(model)) if (all(c("man", "woman", "king", "queen") %in% vocab)) { embeddings <- as.matrix(model) words_of_interest <- c("man", "woman", "king", "queen") word_embeddings <- embeddings[words_of_interest, ] pca_result <- prcomp(word_embeddings, center = TRUE, scale. = FALSE) viz_data <- data.frame( word = words_of_interest, x = pca_result$x[, 1], y = pca_result$x[, 2] ) ggplot(viz_data, aes(x = x, y = y, label = word)) + geom_point(size = 4, color = "steelblue") + geom_text_repel(size = 5, fontface = "bold") + geom_segment( data = viz_data[viz_data$word %in% c("man", "king"), ], aes(x = x[1], y = y[1], xend = x[2], yend = y[2]), arrow = arrow(length = unit(0.3, "cm")), color = "firebrick", linewidth = 1 ) + geom_segment( data = viz_data[viz_data$word %in% c("woman", "queen"), ], aes(x = x[1], y = y[1], xend = x[2], yend = y[2]), arrow = arrow(length = unit(0.3, "cm")), color = "steelblue", linewidth = 1 ) + theme_minimal() + labs( title = "Vector arithmetic: parallel relationships", subtitle = "Red arrow (man to king) should parallel blue arrow (woman to queen)", x = "First principal component", y = "Second principal component" ) + theme( plot.title = element_text(size = 14, face = "bold"), axis.text = element_blank(), panel.grid = element_blank() ) } else { cat("Words man, woman, king, queen not all in vocabulary — skipping visualisation.\n") } ``` The two arrows should be roughly parallel and equal in length. This geometric parallelism is what makes vector-arithmetic analogies work: the transformation that turns *man* into *king* (in the embedding space) is approximately the same as the transformation that turns *woman* into *queen*. ::: {.callout-tip} ## Exercises: Word analogies ::: **Q3. A researcher applies the analogy `walking : walk :: running : ?` to a word2vec model trained on a small corpus of 500,000 words and gets poor results (the top result is an unrelated word). She concludes that word2vec cannot capture morphological relationships. Is this conclusion justified? What are the actual reasons the analogy might fail, and what should she do instead?** ```{r} #| echo: false #| label: "analogy_q3" check_question( "The conclusion is not justified. The failure of a specific analogy task on a small corpus does not mean word2vec cannot in principle capture morphological relationships — it means the conditions for reliable analogy performance were not met. Word2vec analogies require (1) a large corpus (100M+ words is ideal; 500K is far too small), (2) all three words to appear frequently enough to have well-estimated embeddings, and (3) the relationship to be consistently represented across many examples in the training data. On small corpora, rare word pairs have noisy, poorly estimated vectors, making analogy arithmetic unreliable. The researcher should either train on a much larger corpus, use pre-trained embeddings (GloVe or fastText), or use a morphologically aware model like fastText that handles subword information explicitly.", options = c( "The conclusion is justified — word2vec is fundamentally unable to capture morphological relationships", "The conclusion is not justified. The failure of a specific analogy task on a small corpus does not mean word2vec cannot in principle capture morphological relationships — it means the conditions for reliable analogy performance were not met. Word2vec analogies require (1) a large corpus (100M+ words is ideal; 500K is far too small), (2) all three words to appear frequently enough to have well-estimated embeddings, and (3) the relationship to be consistently represented across many examples in the training data. On small corpora, rare word pairs have noisy, poorly estimated vectors, making analogy arithmetic unreliable. The researcher should either train on a much larger corpus, use pre-trained embeddings (GloVe or fastText), or use a morphologically aware model like fastText that handles subword information explicitly.", "The conclusion is justified because word2vec uses whole-word representations and cannot see morphological structure", "The conclusion is not justified, but the solution is simply to run more training iterations on the same corpus" ), type = "radio", q_id = "analogy_q3", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! Corpus size is the single most important factor for analogy performance. With 500,000 words, most word pairs appear too infrequently for their embeddings to be well-estimated, making vector arithmetic noisy and unreliable. The classic Mikolov et al. (2013) results were obtained on a 100-billion-word Google News corpus. On smaller corpora, using pre-trained embeddings is almost always preferable. For morphological relationships specifically, fastText is the better choice because it represents words as bags of character n-grams and can compute reasonable embeddings for words it has never seen in full form.", wrong = "Not quite. The issue is not a fundamental limitation of word2vec — it is a data size problem. Analogy tasks require words to be frequent enough that their embedding vectors are stable and well-estimated. With 500,000 words, even moderately common words have very few training examples, producing noisy vectors that make vector arithmetic unreliable. The original word2vec analogy results used 100 billion words. More training iterations on the same small corpus will not fix this — you need more data, not more passes over the same data." ) ``` --- # Part 6: Visualising Embeddings {#part6} ::: {.callout-note} ## Section Overview **What you will learn:** Why embeddings need dimensionality reduction for visualisation; how t-SNE works conceptually; how to run t-SNE in R; and how to interpret the resulting plot ::: ## The dimensionality challenge {-} Embedding spaces have 50–300 dimensions. Human perception is limited to 2–3 dimensions. To visualise where words sit relative to each other, we need to project the high-dimensional space down to 2D while preserving the local neighbourhood structure as faithfully as possible. **t-SNE** (t-Distributed Stochastic Neighbor Embedding) is the most popular method for this purpose. It works by modelling the probability that two points are neighbours in the high-dimensional space, then finding a 2D configuration that matches those probabilities as closely as possible. Words that are close in 100D should be close in 2D; words that are far apart should remain far apart (approximately). ## Preparing the data {-} ```{r tsne-prep, message=FALSE, warning=FALSE} words_to_plot <- c( # Characters "alice", "queen", "king", "hatter", "rabbit", # Emotions "happy", "sad", "angry", "joy", "fear", # Actions "walk", "run", "jump", "sit", "stand", # Places "house", "garden", "forest", "city", "ocean", # Abstract "love", "hate", "hope", "dream", "thought" ) # Keep only words in our vocabulary words_to_plot <- words_to_plot[words_to_plot %in% rownames(embedding_matrix)] plot_embeddings <- embedding_matrix[words_to_plot, ] cat("Words available for t-SNE plot:", length(words_to_plot), "\n") ``` ## Running t-SNE {-} ```{r tsne-run, message=FALSE, warning=FALSE} set.seed(42) tsne_result <- Rtsne( plot_embeddings, dims = 2, perplexity = min(10, (nrow(plot_embeddings) - 1) / 3), theta = 0.0, max_iter = 1000 ) tsne_data <- data.frame( word = words_to_plot, x = tsne_result$Y[, 1], y = tsne_result$Y[, 2], category = case_when( words_to_plot %in% c("alice", "queen", "king", "hatter", "rabbit") ~ "Characters", words_to_plot %in% c("happy", "sad", "angry", "joy", "fear") ~ "Emotions", words_to_plot %in% c("walk", "run", "jump", "sit", "stand") ~ "Actions", words_to_plot %in% c("house", "garden", "forest", "city", "ocean") ~ "Places", TRUE ~ "Abstract" ) ) ``` ## Plotting the embedding space {-} ```{r tsne-plot, message=FALSE, warning=FALSE, fig.width=10, fig.height=8} ggplot(tsne_data, aes(x = x, y = y, color = category, label = word)) + geom_point(size = 3, alpha = 0.7) + geom_text_repel(size = 4, max.overlaps = 20, box.padding = 0.5) + scale_color_brewer(palette = "Set2") + theme_minimal() + theme( legend.position = "bottom", plot.title = element_text(size = 16, face = "bold"), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank() ) + labs( title = "Word embeddings visualised with t-SNE", subtitle = "Semantically similar words cluster together", x = NULL, y = NULL, color = "Category" ) ``` **Reading the plot:** - **Proximity indicates semantic similarity** — words that are close together in the plot have similar embedding vectors, meaning they tend to appear in similar contexts - **Clusters reflect shared distributional patterns** — words from the same semantic category often cluster together - **Absolute position is arbitrary** — the specific coordinates have no meaning; only relative distances matter - **t-SNE preserves local structure** — words that are neighbours in the high-dimensional space will be neighbours in 2D, but distances between distant clusters are not reliably preserved ::: {.callout-note} ## t-SNE parameter notes The **perplexity** parameter (roughly: how many neighbours to consider) has a large effect on the visual appearance. Too low a value overemphasises local structure; too high a value loses it. A value of 5–50 is typically appropriate, with 30 being the most common default. The **theta** parameter controls the speed-accuracy trade-off: 0.0 gives exact t-SNE (slow but accurate); 0.5 gives the Barnes-Hut approximation (much faster, good enough for large datasets). Importantly, t-SNE is stochastic — running it twice with different seeds gives different plots. Always set a seed for reproducibility. ::: --- # Part 7: Pre-Trained Embeddings {#part7} ::: {.callout-note} ## Section Overview **What you will learn:** When to use pre-trained embeddings rather than training your own; how to download and use GloVe; and how to choose between GloVe, fastText, and BERT for your research question ::: ## When to use pre-trained embeddings {-} Pre-trained embeddings offer several advantages over training your own. They are trained on massive datasets — billions of words — giving much better coverage of rare words and more reliable semantic representations. They require no training time. And they have been validated on standard benchmarks and used in published research, making results easier to contextualise and compare. You should train your own embeddings when: - Your corpus is in a specialised domain with vocabulary not well represented in general corpora (e.g. historical texts, clinical notes, legal documents, a specific language or dialect) - Your research question concerns the distributional patterns in a particular corpus (e.g. comparing across time periods or registers) - Pre-trained embeddings for your language are not available ## GloVe embeddings {-} GloVe (Global Vectors for Word Representation) [@pennington2014glove] is trained on aggregated global word-word co-occurrence statistics from a large corpus. Unlike word2vec, which uses a local context window, GloVe explicitly factorises the word-word co-occurrence matrix, combining the advantages of global matrix factorisation and local window-based methods. GloVe embeddings trained on Wikipedia and Gigaword (6 billion tokens, 400K vocabulary) are widely used as a general-purpose baseline. ```{r glove-download, eval=FALSE} library(textdata) # Download 100-dimensional GloVe vectors (one-time, ~800 MB) glove <- embedding_glove6b(dimensions = 100) ``` ```{r glove-use, eval=FALSE} # Rename columns for clarity colnames(glove)[1] <- "word" colnames(glove)[2:ncol(glove)] <- paste0("dim_", 1:100) # Convert to matrix glove_matrix <- as.matrix(glove[, -1]) rownames(glove_matrix) <- glove$word # Find similar words target_word <- "linguistics" target_vector <- glove_matrix[target_word, ] similarities <- apply(glove_matrix, 1, function(x) { sum(x * target_vector) / (sqrt(sum(x^2)) * sqrt(sum(target_vector^2))) }) head(sort(similarities, decreasing = TRUE), 10) ``` For comprehensive background on neural network approaches to NLP, including the theory behind word embeddings, see @goldberg2017neural. For an accessible introduction to transformer models and contextualised embeddings, see @tunstall2022nlp. ## Choosing between pre-trained models {-} | Model | Trained on | Vocabulary | Dimensions | Best for | |---|---|---|---|---| | **GloVe** | Wikipedia + Gigaword (6B tokens) | 400K | 50–300 | General English, linguistics research | | **fastText** | Common Crawl (600B tokens) | 2M+ | 300 | Morphologically rich languages, rare words, misspellings | | **word2vec (Google News)** | Google News (100B tokens) | 3M | 300 | News domain | | **BERT** | Wikipedia + BookCorpus | Contextual | 768 | Context-dependent tasks, polysemy | ::: {.callout-tip} ## Choosing a model **GloVe** is usually the best starting point for general linguistic research: it is simple to use, well-documented, and widely cited. **fastText** [@bojanowski2017enriching] handles out-of-vocabulary words by representing them as the sum of their character n-gram vectors. This makes it superior for morphologically complex languages (German, Finnish, Turkish), historical texts with spelling variation, and social media text with neologisms and misspellings. **BERT and other transformer models** produce contextualised embeddings — the same word gets a different vector depending on its context. This is essential for tasks involving polysemy (word sense disambiguation), but adds substantial complexity and computational cost. See the LADAL [Text Classification with BERT](/tutorials/bert/bert.html) tutorial for an introduction. ::: ```{r load-pretrained, eval=FALSE} # fastText (requires fastrtext package) library(fastrtext) model_ft <- load_model("path/to/fasttext/model.bin") # For BERT and other transformers, use the `text` package library(text) embeddings_bert <- textEmbed( texts = c("The bank is near the river", "I need to visit the bank"), model = "bert-base-uncased" ) # "bank" gets DIFFERENT vectors in these two sentences ``` --- # Part 8: Research Applications {#part8} ::: {.callout-note} ## Section Overview **What you will learn:** How to apply embeddings to three substantive linguistic research questions: diachronic semantic change detection, gender bias measurement, and metaphor analysis ::: ## Semantic change detection {-} Tracking how word meanings shift over time is one of the most compelling applications of word embeddings in linguistics. By training separate models on corpora from different historical periods and comparing the nearest neighbours of a target word, researchers can document, date, and quantify semantic change [@hamilton2016diachronic]. Hamilton, Leskovec & Jurafsky proposed two statistical laws of semantic change: the **law of conformity** (high-frequency words change more slowly) and the **law of innovation** (words with more senses change faster). Both laws were discovered by analysing diachronic embeddings at scale across multiple languages and centuries. ```{r semantic-change, eval=FALSE} # Conceptual example — requires historical corpora corpus_1800s <- load_corpus("1800-1850") corpus_2000s <- load_corpus("2000-2020") model_1800s <- word2vec(corpus_1800s, dim = 100, iter = 20) model_2000s <- word2vec(corpus_2000s, dim = 100, iter = 20) target_word <- "gay" # Neighbours in 1800s: cheerful, merry, lively, bright, festive neighbors_1800s <- predict(model_1800s, target_word, type = "nearest", top_n = 10) # Neighbours in 2000s: lesbian, queer, bisexual, pride, homosexual neighbors_2000s <- predict(model_2000s, target_word, type = "nearest", top_n = 10) # The shift in neighbourhood tells us when and how the meaning changed ``` ## Bias detection {-} Embeddings trained on large text corpora absorb the statistical regularities of human language use — including its biases. Research has shown that standard English word embeddings associate occupational terms with genders in ways that mirror historical stereotypes: *doctor* and *engineer* are closer to the male pole of the gender axis; *nurse* and *secretary* are closer to the female pole [@bolukbasi2016man; @caliskan2017semantics; @garg2018word]. ```{r bias-detection, eval=FALSE} # Define a gender direction vector man_vec <- embedding_matrix["man", ] woman_vec <- embedding_matrix["woman", ] gender_direction <- woman_vec - man_vec # Measure gender association of occupational terms occupations <- c("doctor", "nurse", "engineer", "teacher", "programmer", "secretary") occupation_bias <- sapply(occupations, function(occ) { if (occ %in% rownames(embedding_matrix)) { occ_vec <- embedding_matrix[occ, ] # Project onto gender direction sum(occ_vec * gender_direction) / (sqrt(sum(occ_vec^2)) * sqrt(sum(gender_direction^2))) } else NA }) # Positive values = more female-associated; negative = more male-associated sort(occupation_bias) ``` ::: {.callout-warning} ## Ethical considerations when using embeddings Embeddings encode and reproduce the biases present in their training data. This has practical consequences: systems that use embeddings for hiring, lending, or content recommendation can perpetuate and amplify historical inequalities even without explicit discriminatory intent. For researchers, the key obligations are: - Acknowledge the limitations and potential biases of any embedding model used - Do not treat embedding-based associations as objective truths about the world — they reflect patterns in text, which reflect patterns of human behaviour and historical inequalities - When using embeddings in applied contexts, consider debiasing techniques [@bolukbasi2016man], though note that these techniques are themselves imperfect and contested - Use diverse, balanced training corpora where possible ::: ## Metaphor analysis {-} Word embeddings can quantify cross-domain semantic associations that underlie conceptual metaphors. The conceptual metaphor IDEAS ARE LIGHT predicts that words from the light/illumination domain should be semantically similar to words from the knowledge/understanding domain. We can test this by computing the similarity matrix between source and target domain words [@jurafsky2024speech]. ```{r metaphor-analysis, eval=FALSE} source_domain <- c("light", "bright", "illuminate", "shine", "glow") target_domain <- c("idea", "thought", "insight", "knowledge", "understanding") metaphor_matrix <- matrix( 0, nrow = length(source_domain), ncol = length(target_domain), dimnames = list(source_domain, target_domain) ) for (s in source_domain) { for (t in target_domain) { if (s %in% rownames(embedding_matrix) && t %in% rownames(embedding_matrix)) { sv <- embedding_matrix[s, ] tv <- embedding_matrix[t, ] metaphor_matrix[s, t] <- sum(sv * tv) / (sqrt(sum(sv^2)) * sqrt(sum(tv^2))) } } } # Visualise as a heatmap library(pheatmap) pheatmap(metaphor_matrix, main = "IDEAS ARE LIGHT metaphor: cross-domain similarities", display_numbers = TRUE, number_format = "%.2f") ``` --- # Part 9: Advanced Topics and Practical Workflow {#part9} ::: {.callout-note} ## Section Overview **What you will learn:** Tips for getting better embeddings; common training errors and their fixes; how to evaluate embedding quality; an overview of document and sentence embeddings; and a practical decision framework for real research projects ::: ## Getting better embeddings {-} **Data quality and quantity are the most important factors.** For word2vec, aim for at least 10 million words; 100 million is better. Clean your data carefully: ```r corpus_clean <- raw_text |> iconv(to = "UTF-8", sub = "") |> # Fix encoding tolower() |> str_replace_all("http\\S+", " ") |> # Remove URLs str_replace_all("\\d+", " ") |> # Handle numbers str_replace_all("[^[:alnum:][:space:]]", " ") |> # Remove punctuation str_replace_all("\\s+", " ") |> str_trim() ``` ## Common problems and solutions {-} **"Training failed: fileMapper" error** The most common word2vec error. Cause: the text was passed as a single long string rather than a character vector of sentences. Fix: split into sentences first. ```r # Wrong model <- word2vec(paste(texts, collapse = " ")) # Correct sentences <- texts |> paste(collapse = " ") |> str_split("\\.\\s+") |> unlist() model <- word2vec(sentences) ``` **Poor quality results** — increase corpus size; check preprocessing did not remove too much content; try more iterations; experiment with CBOW vs skip-gram. **Out-of-vocabulary words** — lower `min_count`; use fastText (handles subwords); use a pre-trained model with larger vocabulary. **Slow training** — reduce dimensions; use smaller window size; increase number of threads; consider `text2vec` package which is faster for large corpora. ## Evaluating embeddings {-} **Intrinsic evaluation** measures how well embeddings capture human semantic judgements: ```r # Word similarity datasets: WordSim-353, SimLex-999 evaluate_similarity <- function(model, test_pairs) { model_scores <- sapply(1:nrow(test_pairs), function(i) { predict(model, newdata = c(test_pairs$word1[i], test_pairs$word2[i]), type = "similarity") }) cor(model_scores, test_pairs$human_score, method = "spearman") } ``` **Extrinsic evaluation** measures performance on a downstream task (text classification, NER, sentiment analysis). This is generally more informative for applied research because it directly measures whether the embeddings are useful for your specific application. ## Document embeddings {-} For document-level tasks (document similarity, clustering, classification), averaging word vectors is a simple and often effective approach. More principled alternatives include doc2vec (paragraph vectors) and sentence transformers. ```r # Simple document embedding: average of word vectors doc_to_vector <- function(text, embedding_matrix) { words <- tolower(unlist(strsplit(text, "\\s+"))) words <- words[words %in% rownames(embedding_matrix)] if (length(words) == 0) return(NULL) colMeans(embedding_matrix[words, , drop = FALSE]) } # For higher-quality sentence embeddings, use the `text` package with BERT library(text) sentence_embeddings <- textEmbed( texts = your_sentences, model = "sentence-transformers/all-MiniLM-L6-v2" ) ``` ## Decision framework {-} ``` Do you have a domain-specific corpus that differs substantially from general English? Yes: Should you train your own model? Large corpus (10M+ words) → Train custom embeddings Small corpus → Use pre-trained embeddings + fine-tuning if needed No: Use pre-trained embeddings General English tasks → GloVe (simple, well-validated) Morphologically rich / rare words → fastText Context-dependent meaning → BERT / sentence transformers Is your research question diachronic (meaning change over time)? → Train separate models on period-specific corpora → Use alignment methods (Procrustes) to make spaces comparable Is your research question about bias or social meaning? → Pre-trained models on large general corpora (GloVe, word2vec Google News) → These are better snapshots of general language use than small custom corpora ``` ## Reproducibility checklist {-} ```r # Document everything so results can be reproduced analysis_metadata <- list( date = Sys.Date(), corpus = "alice.txt + moby.txt + pride.txt", corpus_words = str_count(corpus_clean, "\\S+"), preprocessing = list(lowercase = TRUE, remove_punct = FALSE, min_count = 5), model_params = list(type = "skip-gram", dim = 100, window = 5, iter = 20), random_seed = 42, r_version = paste(R.version$major, R.version$minor, sep = ".") ) saveRDS(analysis_metadata, "model_metadata.rds") set.seed(42) # Always set before training ``` ::: {.callout-tip} ## Exercises: Advanced topics ::: **Q4. A researcher wants to compare word meanings across three registers: academic writing, newspaper text, and Twitter posts. She plans to train a single word2vec model on a combined corpus of all three. A colleague suggests training three separate models instead. Which approach is better, and why?** ```{r} #| echo: false #| label: "advanced_q3" check_question( "Training three separate models is better for this research question. A single combined model will produce average embeddings that blend the distributional patterns from all three registers, obscuring the register-specific differences the researcher wants to study. For example, 'sick' means something different in academic writing (illness) and Twitter (impressive/cool); averaging these distributions will produce a blurred representation that captures neither clearly. By training separate models and comparing the nearest neighbours of target words across the three models, the researcher can directly observe and quantify register-specific semantic differences. The challenge is that separate models have differently oriented vector spaces and cannot be directly compared — alignment methods such as Procrustes rotation are needed to map the spaces into a common coordinate system before comparison.", options = c( "Training a single combined model is better because it has more training data and will produce more reliable embeddings", "Both approaches give identical results because word2vec is insensitive to the source of training data", "Training three separate models is better for this research question. A single combined model will produce average embeddings that blend the distributional patterns from all three registers, obscuring the register-specific differences the researcher wants to study. For example, 'sick' means something different in academic writing (illness) and Twitter (impressive/cool); averaging these distributions will produce a blurred representation that captures neither clearly. By training separate models and comparing the nearest neighbours of target words across the three models, the researcher can directly observe and quantify register-specific semantic differences. The challenge is that separate models have differently oriented vector spaces and cannot be directly compared — alignment methods such as Procrustes rotation are needed to map the spaces into a common coordinate system before comparison.", "Pre-trained GloVe embeddings should be used instead because they were trained on a broader corpus" ), type = "radio", q_id = "advanced_q3", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The register comparison research question requires separate models precisely so that register-specific distributional patterns are not averaged away. The key methodological challenge is that separately trained models are not directly comparable because they are independently initialised and converge to differently oriented vector spaces. Procrustes alignment (orthogonal transformation of one model's space to best match the other's) is the standard solution for making separately trained embeddings comparable. The same technique is used in diachronic semantics research when comparing models trained on different time periods.", wrong = "Not quite. The research question here is specifically about how word meaning differs across registers — and the answer requires observing those differences, not averaging them out. A single combined model will blend the register-specific patterns and make them harder to detect. Training three separate models and comparing their nearest-neighbour sets for target words is the methodologically appropriate approach, though it requires an alignment step to make the three vector spaces comparable." ) ``` --- # Quick Reference {.unnumbered} ## Essential functions {-} ```{r ref-functions, eval=FALSE} # Training model <- word2vec(x = sentence_vector, type = "skip-gram", dim = 100, window = 5, iter = 20, min_count = 5) # Finding similar words similar <- predict(model, "king", type = "nearest", top_n = 10) # Get embedding matrix embeddings <- as.matrix(model) # Cosine similarity between two words cosine_sim <- function(a, b, mat) { sum(mat[a,] * mat[b,]) / (sqrt(sum(mat[a,]^2)) * sqrt(sum(mat[b,]^2))) } # Save and load model write.word2vec(model, "model.bin") model <- read.word2vec("model.bin") # t-SNE visualisation tsne_result <- Rtsne(word_subset_matrix, dims = 2, perplexity = 30, theta = 0.5, max_iter = 1000, set.seed(42)) ``` ## Common workflows {-} ```r # Basic similarity analysis text |> preprocess() |> word2vec(dim = 100) -> model predict(model, "target_word", type = "nearest", top_n = 10) # Visualisation pipeline embeddings <- as.matrix(model) words_subset <- embeddings[selected_words, ] set.seed(42) tsne_result <- Rtsne(words_subset, dims = 2, perplexity = 10) plot_data <- data.frame(word = selected_words, x = tsne_result$Y[,1], y = tsne_result$Y[,2]) # Word analogy analogy_vector <- embeddings["king",] - embeddings["man",] + embeddings["woman",] # Find nearest neighbour to analogy_vector in embedding space ``` --- # Final Project Ideas {.unnumbered} ::: {.callout-note} ## Capstone projects Apply what you have learned with these research projects: **1. Historical semantic change** Collect texts from different decades (e.g. from Project Gutenberg for historical periods; from newspaper archives for recent decades). Train separate embedding models. Track the nearest neighbours of target words over time. Visualise changes and identify the approximate date of semantic shifts. **2. Domain-specific terminology extraction** Gather a specialised corpus (medical, legal, technical, or a specific academic field). Train custom embeddings. Use the nearest neighbours of known domain terms as seeds to extract further domain-specific vocabulary. Compare to a general English model. **3. Register comparison** Compare embeddings trained on different registers (formal vs. informal, spoken vs. written, academic vs. popular). Use Procrustes alignment to make the models comparable. Identify words whose meanings are most register-specific. **4. Bias audit** Load pre-trained GloVe or word2vec embeddings. Define a gender or ethnicity direction vector. Project occupational terms onto this direction. Quantify associations and compare to historical data on occupational demographics. Replicate or extend the methodology of Garg et al. [-@garg2018word]. **5. Metaphor mapping** Identify a conceptual metaphor (e.g. ARGUMENT IS WAR, TIME IS MONEY, LIFE IS A JOURNEY). Define source and target domain vocabulary. Compute the cross-domain similarity matrix using your embeddings. Compare across languages or registers. **Suggested deliverables:** A documented R script, at least two visualisations, and a brief report (1000–1500 words) interpreting and discussing your findings. ::: --- # Citation & Session Info {.unnumbered} ::: {.callout-note} ## Citation ```{r citation-callout, echo=FALSE, results='asis'} cat( params$author, ". ", params$year, ". *", params$title, "*. ", params$institution, ". ", "url: ", params$url, " ", "(Version ", params$version, "), ", "doi: ", params$doi, ".", sep = "" ) ``` ```{r citation-bibtex, echo=FALSE, results='asis'} key <- paste0( tolower(gsub(" ", "", gsub(",.*", "", params$author))), params$year, tolower(gsub("[^a-zA-Z]", "", strsplit(params$title, " ")[[1]][1])) ) cat("```\n") cat("@manual{", key, ",\n", sep = "") cat(" author = {", params$author, "},\n", sep = "") cat(" title = {", params$title, "},\n", sep = "") cat(" year = {", params$year, "},\n", sep = "") cat(" note = {", params$url, "},\n", sep = "") cat(" organization = {", params$institution, "},\n", sep = "") cat(" edition = {", params$version, "}\n", sep = "") cat(" doi = {", params$doi, "}\n", sep = "") cat("}\n```\n") ``` ::: ::: {.callout-note} ## AI Transparency Statement This tutorial was written with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. Claude was used to substantially revise and expand an earlier tutorial by Martin Schweinberger. All references were verified independently before BibTeX entries were generated. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy. ::: ```{r session-info} sessionInfo() ``` --- [Back to top](#intro) [Back to LADAL home](/) --- # References {.unnumbered}