Lexicography with R

Schweinberger, Martin

doi:10.5281/zenodo.19332900

Lexicography with R

This case study tutorial demonstrates how to create dictionaries computationally in R, covering synonym finding, semantic similarity measures, and the automatic generation of dictionary entries from corpus data. It is aimed at researchers in lexicography, computational linguistics, and digital humanities who want to apply corpus and semantic methods to dictionary creation.

Author

Martin Schweinberger

Published

2026

Great Court, The University of Queensland

Introduction

This tutorial introduces computational lexicography with R and shows how to use R to create dictionaries, find synonyms, and generate bilingual translation lexicons through statistical analysis of corpus data. While the initial examples focus on English, subsequent sections demonstrate how the approach generalises to other languages — including German — using the udpipe package, which supports more than 60 languages.

Traditionally, dictionaries are listings of words arranged alphabetically, providing information on definitions, usage, etymologies, pronunciations, translations, and related forms (Agnes, Goldman, and Soltis 2002; Steiner 1985). Computational lexicology is the branch of computational linguistics concerned with the computer-based study of lexicons and machine-readable dictionaries (Amsler 1981). Computational lexicography, the focus of this tutorial, is the use of computers in the construction of dictionaries. Although the two terms are sometimes used interchangeably, the distinction between studying a lexicon and building one is conceptually important.

The tutorial is structured around three increasingly complex tasks: (1) generating a basic annotated dictionary from corpus text using part-of-speech tagging; (2) identifying synonym candidates using distributional semantics and cosine similarity; and (3) building a bilingual translation lexicon from parallel text using co-occurrence statistics.

Learning Objectives

By the end of this tutorial you will be able to:

Generate a basic annotated dictionary from corpus text using part-of-speech tagging with udpipe
Correct, extend, and enrich dictionary entries with additional layers of information (sentiment, comments)
Build a term-document matrix from corpus co-occurrence data
Compute Positive Pointwise Mutual Information (PPMI) and cosine similarity between items
Use hierarchical clustering to visualise semantic similarity among words
Extract synonym candidates automatically from a cosine similarity matrix
Create a bilingual translation lexicon from parallel text using contingency-based association measures
Apply the same workflow to languages other than English using multilingual udpipe models

Prerequisite Tutorials

Before working through this tutorial, we recommend familiarity with the following:

Citation

Martin Schweinberger. 2026. Lexicography with R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/lexicography/lexicography.html (Version 3.1.1). doi: 10.5281/zenodo.19332901.

Preparation and Session Set-up

Install required packages once:

Code

install.packages("dplyr")
install.packages("stringr")
install.packages("udpipe")
install.packages("tidytext")
install.packages("tidyr")
install.packages("coop")
install.packages("flextable")
install.packages("textdata")
install.packages("plyr")
install.packages("checkdown")

Load packages for this session:

Code

library(checkdown)   # interactive exercises
library(dplyr)       # data manipulation
library(stringr)     # string processing
library(udpipe)      # part-of-speech tagging (60+ languages)
library(tidytext)    # text mining and sentiment lexicons
library(tidyr)       # data reshaping
library(coop)        # cosine similarity
library(flextable)   # formatted tables
library(plyr)        # join operations for parallel data

Creating Dictionaries

Section Overview

What you will learn: How to use part-of-speech tagging to generate a structured dictionary from raw corpus text, and how to extend and enrich dictionary entries with sentiment information.

Key tools: udpipe for multilingual tagging, tidytext for sentiment lexicons, dplyr for table manipulation.

Loading and tagging the corpus text

In a first step, we load a text. We use George Orwell’s Nineteen Eighty-Four as the source text for our English dictionary.

Code

text <- readLines("tutorials/lexicography/data/orwell.txt") |>
  paste0(collapse = " ")
# show the first 500 characters of the text
substr(text, start = 1, stop = 500)

[1] "1984 George Orwell Part 1, Chapter 1 It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him. The hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It "

Next, we download a udpipe language model for English. The udpipe package supports more than 60 languages, making this approach directly transferable to other research contexts.

Code

# download English language model (run once, then use lex2 to load from disk)
m_eng <- udpipe::udpipe_download_model(language = "english-ewt")

Once downloaded, load the model directly from disk:

Code

# load language model from disk
m_eng <- udpipe_load_model(file = here::here("udpipemodels", "english-ewt-ud-2.5-191206.udpipe"))

We now apply the part-of-speech tagger to the full text. udpipe_annotate() returns a data frame with one row per token, including token form, lemma, universal POS tag, and dependency information:

Code

# tokenise, tag, and parse
text_ann <- udpipe::udpipe_annotate(m_eng, x = text) |>
  as.data.frame() |>
  dplyr::select(
    -sentence, -paragraph_id, -sentence_id, -feats,
    -head_token_id, -dep_rel, -deps, -misc
  )
# inspect
head(text_ann, 10)

   doc_id token_id   token   lemma  upos xpos
1    doc1        1    1984    1984 PROPN  NNP
2    doc1        2  George  George PROPN  NNP
3    doc1        3  Orwell  Orwell PROPN  NNP
4    doc1        4    Part    part PROPN  NNP
5    doc1        5       1       1   NUM   CD
6    doc1        6       ,       , PUNCT    ,
7    doc1        7 Chapter chapter PROPN  NNP
8    doc1        8       1       1   NUM   CD
9    doc1        1      It      it  PRON  PRP
10   doc1        2     was      be   AUX  VBD

Generating the basic dictionary

We use the annotated data to generate a first, basic dictionary holding the word form (token), the part-of-speech tag (upos), the lemmatised word type (lemma), and the frequency with which that word form is used as that part-of-speech in the corpus. We begin by arranging entries by frequency, which is useful for spotting the most important vocabulary items quickly.

Code

text_dict_raw <- text_ann |>
  # remove non-word tokens (punctuation, symbols)
  dplyr::filter(!stringr::str_detect(token, "\\W")) |>
  # remove numeric tokens
  dplyr::filter(!stringr::str_detect(token, "[0-9]")) |>
  dplyr::group_by(token, lemma, upos) |>
  dplyr::summarise(frequency = dplyr::n(), .groups = "drop") |>
  dplyr::arrange(-frequency)
# inspect
head(text_dict_raw, 10)

# A tibble: 10 × 4
   token lemma upos  frequency
   <chr> <chr> <chr>     <int>
 1 the   the   DET        5249
 2 of    of    ADP        2908
 3 a     a     DET        2277
 4 and   and   CCONJ      2064
 5 was   be    AUX        1795
 6 in    in    ADP        1446
 7 to    to    PART       1336
 8 it    it    PRON       1295
 9 he    he    PRON       1270
10 had   have  AUX        1018

Dictionary conventions call for alphabetical ordering. We can switch to that with a single arrange() call:

Code

text_dict <- text_dict_raw |>
  dplyr::arrange(token)
# inspect
head(text_dict, 10)

# A tibble: 10 × 4
   token     lemma     upos  frequency
   <chr>     <chr>     <chr>     <int>
 1 A         a         DET         107
 2 A         a         NOUN          1
 3 AND       and       CCONJ         2
 4 Aaronson  Aaronson  PROPN         8
 5 About     about     ADV           4
 6 Above     above     ADP           2
 7 Abruptly  abruptly  ADV           2
 8 Actually  actually  ADV          13
 9 Adam      Adam      PROPN         1
10 Admission admission NOUN          1

Tagging Accuracy and Manual Post-Editing

POS tagging is not perfect — some tokens will receive incorrect tags and some lemmas will be wrong. Even state-of-the-art taggers reach around 95–97% accuracy on standard text, which means visible errors are inevitable at this scale. The resulting dictionary requires manual review before publication. However, the computational workflow dramatically reduces the effort needed to produce a first draft: instead of generating thousands of entries from scratch, the researcher begins with a near-complete list and corrects errors rather than creating every entry.

Correcting and extending dictionary entries

One of the advantages of keeping dictionaries in R as data frames is that entries are easy to correct and extend programmatically. Below we demonstrate removing a spurious entry, correcting a POS tag, and adding an annotation column with custom notes.

Code

text_dict_ext <- text_dict |>
  # remove spurious entry: 'a' tagged as NOUN
  dplyr::filter(!(lemma == "a" & upos == "NOUN")) |>
  # correct POS tag: 'aback' should be PREP, not NOUN
  dplyr::mutate(upos = ifelse(lemma == "aback" & upos == "NOUN", "PREP", upos)) |>
  # add custom comments
  dplyr::mutate(comment = dplyr::case_when(
    lemma == "a"        ~ "also 'an' before vowels",
    lemma == "Aaronson" ~ "name of a character in the novel",
    TRUE                ~ ""
  ))
# inspect
head(text_dict_ext, 10)

# A tibble: 10 × 5
   token     lemma     upos  frequency comment                           
   <chr>     <chr>     <chr>     <int> <chr>                             
 1 A         a         DET         107 "also 'an' before vowels"         
 2 AND       and       CCONJ         2 ""                                
 3 Aaronson  Aaronson  PROPN         8 "name of a character in the novel"
 4 About     about     ADV           4 ""                                
 5 Above     above     ADP           2 ""                                
 6 Abruptly  abruptly  ADV           2 ""                                
 7 Actually  actually  ADV          13 ""                                
 8 Adam      Adam      PROPN         1 ""                                
 9 Admission admission NOUN          1 ""                                
10 Africa    Africa    PROPN        10 ""

Adding sentiment information

To make the dictionary more informative, we enrich each entry with sentiment information from the tidytext package. We use the Bing Liu lexicon (Liu and Zhang 2012), which classifies words as positive or negative.

Code

text_dict_snt <- text_dict_ext |>
  dplyr::mutate(word = lemma) |>
  dplyr::left_join(get_sentiments("bing"), by = "word") |>
  dplyr::group_by(token, lemma, upos, comment) |>
  dplyr::summarise(
    sentiment = paste0(unique(sentiment[!is.na(sentiment)]), collapse = ", "),
    .groups = "drop"
  )
# inspect
head(text_dict_snt, 10)

# A tibble: 10 × 5
   token     lemma     upos  comment                            sentiment 
   <chr>     <chr>     <chr> <chr>                              <chr>     
 1 A         a         DET   "also 'an' before vowels"          ""        
 2 AND       and       CCONJ ""                                 ""        
 3 Aaronson  Aaronson  PROPN "name of a character in the novel" ""        
 4 About     about     ADV   ""                                 ""        
 5 Above     above     ADP   ""                                 ""        
 6 Abruptly  abruptly  ADV   ""                                 "negative"
 7 Actually  actually  ADV   ""                                 ""        
 8 Adam      Adam      PROPN ""                                 ""        
 9 Admission admission NOUN  ""                                 ""        
10 Africa    Africa    PROPN ""                                 ""

The resulting extended dictionary now contains the token, lemma, POS tag, comment, and sentiment label — a richer lexical resource than the basic dictionary we started with, and one generated entirely automatically from corpus data.

Exercises: Creating Dictionaries

Q1. What is the difference between computational lexicology and computational lexicography?

Q2. After POS tagging, you notice that the word ‘run’ is sometimes tagged as VERB and sometimes as NOUN. Which dplyr approach is most appropriate to correct a specific erroneous tag?

Finding Synonyms: Creating a Thesaurus

Section Overview

What you will learn: How to use distributional semantics — co-occurrence statistics, PPMI weighting, and cosine similarity — to identify synonym candidates for a set of degree adverbs.

Key concepts: Term-document matrix, Pointwise Mutual Information (PMI), Positive PMI (PPMI), cosine similarity, hierarchical clustering.

Why distributional methods? The basic assumption of distributional semantics is that words occurring in the same contexts tend to have similar meanings — the distributional hypothesis (Firth 1957). PPMI-weighted cosine similarity has been shown to outperform raw co-occurrence counts for semantic similarity tasks (Bullinaria and Levy 2007; Levshina 2015).

Another key task in lexicography is determining semantic relationships between words — in particular, whether two words are synonymous. In computational linguistics, such relationships are typically determined from collocational profiles, also called word vectors or word embeddings.

In this example, we investigate whether a set of degree adverbs (very, really, so, completely, totally, etc.) are synonymous — that is, whether they can be exchanged without substantially changing the meaning of the sentence. This is directly relevant to lexicography: if two adverbs have similar collocational profiles, a dictionary can link them as synonyms or near-synonyms.

Loading the degree adverb data

The dataset contains three columns: a pint column with the degree adverb, an adjs column with the adjective it modifies, and a remove column we do not need.

Code

degree_adverbs <- base::readRDS("tutorials/lexicography/data/dad.rda", "rb") |>
  dplyr::select(-remove) |>
  dplyr::rename(
    degree_adverb = pint,
    adjective     = adjs
  ) |>
  dplyr::filter(
    degree_adverb != "0",   # remove unmodified adjectives
    degree_adverb != "well" # 'well' behaves differently
  )
# inspect
head(degree_adverbs, 10)

   degree_adverb adjective
1           real       bad
2         really      nice
3           very      good
4         really     early
5         really       bad
6         really       bad
7             so      long
8         really wonderful
9         pretty      good
10        really      easy

Building the term-document matrix

We construct a term-document matrix (TDM) showing how often each degree adverb co-occurred with each adjective. Rows are adjectives; columns are degree adverbs; each cell contains the co-occurrence count.

Code

# create term-document matrix
tdm <- ftable(degree_adverbs$adjective, degree_adverbs$degree_adverb)
# extract dimension names
amplifiers <- as.vector(unlist(attr(tdm, "col.vars")[1]))
adjectives <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# attach names
rownames(tdm) <- adjectives
colnames(tdm) <- amplifiers
# inspect
tdm[1:5, 1:5]

          completely extremely pretty real really
able               0         1      0    0      0
actual             0         0      0    1      0
amazing            0         0      0    0      4
available          0         0      0    0      1
bad                0         0      1    2      3

Computing PPMI and cosine similarity

Raw co-occurrence counts are biased towards frequent words. Pointwise Mutual Information (PMI) corrects for this by comparing observed co-occurrence frequency to what would be expected if the two words were independent. Positive PMI (PPMI) replaces all negative PMI values with zero, which improves performance on semantic similarity tasks (Bullinaria and Levy 2007; Levshina 2015).

We then compute cosine similarity between the PPMI vectors of each degree adverb. Cosine similarity ranges from 0 (no shared context) to 1 (identical context profile).

Code

# compute expected values under independence
tdm.exp <- chisq.test(tdm)$expected
# calculate PMI and PPMI
PMI  <- log2(tdm / tdm.exp)
PPMI <- ifelse(PMI < 0, 0, PMI)
# calculate cosine similarity between amplifier vectors
cosinesimilarity <- cosine(PPMI)
# inspect
cosinesimilarity[1:5, 1:5]

           completely extremely   pretty    real   really
completely    1.00000  0.204189 0.000000 0.05304 0.126668
extremely     0.20419  1.000000 0.007319 0.00000 0.004235
pretty        0.00000  0.007319 1.000000 0.09441 0.062323
real          0.05304  0.000000 0.094413 1.00000 0.131957
really        0.12667  0.004235 0.062323 0.13196 1.000000

Visualising clusters with a dendrogram

We convert the cosine similarity matrix to a distance matrix and apply Ward’s hierarchical clustering to visualise the similarity structure.

Code

# find maximum similarity value that is not 1 (self-similarity)
cosinesimilarity.test <- apply(cosinesimilarity, 1, function(x) {
  x <- ifelse(x == 1, 0, x)
})
maxval <- max(cosinesimilarity.test)
# convert similarity to distance
amplifier.dist <- 1 - (cosinesimilarity / maxval)
clustd <- as.dist(amplifier.dist)

Code

# hierarchical clustering with Ward's method
cd <- hclust(clustd, method = "ward.D")
# plot
plot(cd, main = "", sub = "", yaxt = "n", ylab = "", xlab = "", cex = .8)

The dendrogram reveals interpretable clusters. Completely, extremely, and totally form a cluster of strong, absolute intensifiers that are interchangeable with each other but not with milder adverbs. Real and really cluster together as colloquial variants. This structure matches what an experienced lexicographer would expect, and the method has recovered it automatically from corpus data.

Extracting synonym candidates

To extract synonyms automatically, we find the most similar adverb for each entry in the cosine similarity matrix: we replace diagonal values (each word’s perfect similarity to itself) with 0, then look up the column with the highest remaining value.

A Note on Syntactic Context

The synonym candidates here are based purely on collocational profile similarity. A complete synonym analysis would also consider syntactic context: very and so have similar profiles, but so is strongly disfavoured in attributive position (a so great tutorial is unusual, whereas a very great tutorial is fine). A full lexicographic treatment would require filtering by syntactic function before computing similarity.

Code

# build synonym table: replace self-similarity (1s) with 0
syntb <- cosinesimilarity |>
  as.data.frame() |>
  dplyr::mutate(word = colnames(cosinesimilarity)) |>
  dplyr::mutate(across(where(is.numeric), ~replace(., . == 1, 0)))
# extract the most similar item for each word
syntb <- syntb |>
  dplyr::mutate(synonym = colnames(syntb)[apply(syntb, 1, which.max)]) |>
  dplyr::select(word, synonym)
syntb

                 word    synonym
completely completely  extremely
extremely   extremely completely
pretty         pretty       real
real             real     really
really         really       real
so                 so       real
totally       totally completely
very             very         so

The results confirm the clustering: completely is paired with totally and vice versa, real is paired with really, and very is paired with pretty — consistent with both prior expectations and the dendrogram above.

For further reading on semantic vector space modelling, Rajeg, Denistia, and Musgrave (2019) provide an accessible introduction, and Levshina (2015) offers a comprehensive treatment of distributional methods for corpus linguists.

Exercises: Finding Synonyms

Q1. Why is Positive PMI (PPMI) preferred over raw PMI for computing semantic similarity?

Q2. In the dendrogram, completely, extremely, and totally form a tight cluster. What does this tell us lexicographically?

Creating Bilingual Dictionaries

Section Overview

What you will learn: How to generate a bilingual translation lexicon from parallel text using word co-occurrence statistics and contingency-based association measures.

Key concepts: Parallel corpus, sentence alignment, co-occurrence matrix, Fisher’s Exact Test, phi coefficient.

Why this matters: Data-driven translation lexicons can be generated for any language pair for which parallel data exists — including low-resource languages where commercial dictionaries are unavailable.

Translation dictionaries map words in one language to their counterparts in another. If a German word and an English word tend to co-occur across sentence-translation pairs, they are likely translations of each other. The quality of the result depends on the quantity and alignment quality of the parallel data, and grammatical differences between languages introduce additional challenges.

Loading parallel text

We load a sample of German sentences and their English translations. Each line contains a German sentence and its English translation, separated by the string — (a spaced em dash).

Code

# load parallel translation data
translations <- readLines("tutorials/lexicography/data/translation.txt",
                          encoding = "UTF-8", skipNul = TRUE)

translations
Guten Tag! — Good day!
Guten Morgen! — Good morning!
Guten Abend! — Good evening!
Hallo! — Hello!
Wo kommst du her? — Where are you from?
Woher kommen Sie? — Where are you from?
Ich bin aus Hamburg. — I am from Hamburg.
Ich komme aus Hamburg. — I come from Hamburg.
Ich bin Deutscher. — I am German.
Schön Sie zu treffen. — Pleasure to meet you!
Wie lange lebst du schon in Brisbane? — How long have you been living in Brisbane?
Leben Sie schon lange hier? — Have you been living here for long?
Welcher Bus geht nach Brisbane? — Which bus goes to Brisbane?
Von welchem Gleis aus fährt der Zug? — Which platform is the train leaving from?
Ist dies der Bus nach Toowong? — Is this the bus going to Toowong?

Separating German and English sentences

We split the parallel data into two tables — one for German, one for English — each indexed by sentence number. The sentence index preserves the alignment between source and target sentences.

Code

# separate German and English, remove punctuation
german  <- stringr::str_remove_all(translations, " [-\u2014\u2013] .*") |>
           stringr::str_remove_all("[[:punct:]]")
english <- stringr::str_remove_all(translations, ".* [-\u2014\u2013] ") |>
           stringr::str_remove_all("[[:punct:]]")
sentence <- 1:length(german)
germantb  <- data.frame(sentence, german)
englishtb <- data.frame(sentence, english)

sentence	german
1	Guten Tag
2	Guten Morgen
3	Guten Abend
4	Hallo
5	Wo kommst du her
6	Woher kommen Sie
7	Ich bin aus Hamburg
8	Ich komme aus Hamburg
9	Ich bin Deutscher
10	Schön Sie zu treffen
11	Wie lange lebst du schon in Brisbane
12	Leben Sie schon lange hier
13	Welcher Bus geht nach Brisbane
14	Von welchem Gleis aus fährt der Zug
15	Ist dies der Bus nach Toowong

Creating word-level co-occurrence pairs

We tokenise the sentences into individual words and cross-join German and English tokens within each sentence. Each row of the result represents a German–English word pair that co-occurred in the same sentence translation unit.

Code

# tokenise German sentences
german_tokens <- germantb |>
  tidytext::unnest_tokens(word, german)

# join English sentences by sentence id, then tokenise English
transtb <- german_tokens |>
  dplyr::left_join(englishtb, by = "sentence") |>
  tidytext::unnest_tokens(trans, english) |>
  dplyr::rename(german = word, english = trans) |>
  dplyr::select(german, english) |>
  dplyr::mutate(
    german  = factor(german),
    english = factor(english)
  )

german	english
guten	good
guten	day
tag	good
tag	day
guten	good
guten	morning
morgen	good
morgen	morning
guten	good
guten	evening
abend	good
abend	evening
hallo	hello
wo	where
wo	are

Building the co-occurrence matrix

From the word-pair table we construct a co-occurrence matrix: rows are English words, columns are German words, and each cell is the count of how many times that German–English pair appeared in the same sentence pair.

Code

# construct term-document matrix
tdm <- ftable(transtb$german, transtb$english)
# extract dimension names
german  <- as.vector(unlist(attr(tdm, "col.vars")[1]))
english <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# assign names
rownames(tdm) <- english
colnames(tdm) <- german
# inspect
tdm[1:10, 1:10]

         a accident all am ambulance an and any anything are
ab       0        0   0  0         0  0   0   0        0   0
abend    0        0   0  0         0  0   0   0        0   0
allem    0        0   0  0         0  0   0   0        0   0
alles    0        0   1  0         0  0   0   0        0   0
am       0        0   0  0         0  0   0   0        0   0
an       0        0   0  0         0  0   0   0        0   0
anderen  1        0   0  0         0  0   0   0        0   0
apotheke 1        0   0  1         0  0   0   0        0   0
arzt     1        0   0  0         0  0   0   0        0   0
auch     3        0   0  0         0  0   0   0        1   0

Computing association strength

We use Fisher’s Exact Test and the phi coefficient (φ) to measure the statistical association between each German–English word pair, controlling for marginal frequencies — the same approach used in keyword analysis and collocation research.

Code

coocdf <- as.data.frame(as.matrix(tdm))
cooctb <- coocdf |>
  dplyr::mutate(German = rownames(coocdf)) |>
  tidyr::gather(
    English, TermCoocFreq,
    colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]
  ) |>
  dplyr::mutate(
    German  = factor(German),
    English = factor(English)
  ) |>
  dplyr::mutate(AllFreq = sum(TermCoocFreq)) |>
  dplyr::group_by(German) |>
  dplyr::mutate(TermFreq = sum(TermCoocFreq)) |>
  dplyr::ungroup() |>
  dplyr::group_by(English) |>
  dplyr::mutate(CoocFreq = sum(TermCoocFreq)) |>
  dplyr::arrange(German) |>
  dplyr::mutate(
    a = TermCoocFreq,
    b = TermFreq - a,
    c = CoocFreq - a,
    d = AllFreq - (a + b + c)
  ) |>
  dplyr::mutate(NRows = nrow(coocdf)) |>
  dplyr::filter(TermCoocFreq > 0)

German	English	TermCoocFreq	AllFreq	TermFreq	CoocFreq	a	b	c	d	NRows
ab	departing	1	3,504	5	5	1	4	4	3,495	215
ab	is	1	3,504	5	116	1	4	115	3,384	215
ab	the	1	3,504	5	125	1	4	124	3,375	215
ab	train	1	3,504	5	16	1	4	15	3,484	215
ab	when	1	3,504	5	27	1	4	26	3,473	215
abend	evening	1	3,504	2	2	1	1	1	3,501	215
abend	good	1	3,504	2	16	1	1	15	3,487	215
allem	döner	1	3,504	5	10	1	4	9	3,490	215
allem	everything	1	3,504	5	5	1	4	4	3,495	215
allem	one	1	3,504	5	30	1	4	29	3,470	215
allem	please	1	3,504	5	111	1	4	110	3,389	215
allem	with	1	3,504	5	22	1	4	21	3,478	215
alles	all	1	3,504	6	5	1	5	4	3,494	215
alles	for	1	3,504	6	93	1	5	92	3,406	215
alles	no	1	3,504	6	7	1	5	6	3,492	215

Extracting the best translation candidates

We compute Fisher’s Exact Test and the phi coefficient for each word pair, retain only pairs where observed co-occurrence exceeds expected (genuine positive associations), and rank by phi.

Code

translationtb <- cooctb |>
  dplyr::rowwise() |>
  dplyr::mutate(
    p  = round(as.vector(unlist(
      fisher.test(matrix(c(a, b, c, d), ncol = 2, byrow = TRUE))[1])), 5),
    x2 = round(as.vector(unlist(
      chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = TRUE))[1])), 3)
  ) |>
  dplyr::mutate(
    phi      = round(sqrt((x2 / (a + b + c + d))), 3),
    expected = as.vector(unlist(
      chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = TRUE))$expected[1]))
  ) |>
  dplyr::filter(TermCoocFreq > expected) |>
  dplyr::arrange(-phi) |>
  dplyr::select(-AllFreq, -a, -b, -c, -d, -NRows, -expected)

German	English	TermCoocFreq	TermFreq	CoocFreq	p	x2	phi
hallo	hello	1	1	1	0.00029	875.5	0.500
abend	evening	1	2	2	0.00114	218.2	0.250
ja	yes	1	2	2	0.00114	218.2	0.250
morgen	morning	1	2	2	0.00114	218.2	0.250
tag	day	1	2	2	0.00114	218.2	0.250
guten	good	4	13	16	0.00000	201.1	0.240
brauche	need	5	20	27	0.00000	124.2	0.188
nein	no	2	9	7	0.00012	122.7	0.187
bier	beer	2	8	8	0.00013	120.8	0.186
hamburg	hamburg	2	8	8	0.00013	120.8	0.186
braucht	he	1	3	3	0.00257	96.5	0.166
braucht	medication	1	3	3	0.00257	96.5	0.166
braucht	needs	1	3	3	0.00257	96.5	0.166
deutscher	german	1	3	3	0.00257	96.5	0.166
er	he	1	3	3	0.00257	96.5	0.166

The results show that even a small parallel corpus yields reasonable translation candidates. The top-ranked pairs align well with genuine translation equivalents. Mismatches further down the ranking illustrate the need for more data to disambiguate polysemous words and handle idiomatic expressions. The approach scales directly: with a larger parallel corpus, accuracy improves substantially.

Exercises: Bilingual Dictionaries

Q1. Why is raw co-occurrence count insufficient for identifying translation equivalents, and what statistical measure does this tutorial use instead?

Generating Dictionaries for Other Languages

Section Overview

What you will learn: How to apply the same dictionary-generation pipeline to a language other than English, using German as a demonstration.

Key point: Because udpipe supports more than 60 languages, the workflow transfers directly to any supported language by simply changing the model file.

The procedure for generating dictionaries can easily be applied to languages other than English. The only change required is the udpipe language model. Here we demonstrate using a sample of the Brothers Grimm fairy tales as a German-language corpus.

Loading a German corpus

Code

grimm <- readLines("tutorials/lexicography/data/GrimmsFairytales.txt",
                   encoding = "latin1") |>
  paste0(collapse = " ")
# show the first 200 characters
substr(grimm, start = 1, stop = 200)

[1] "Der Froschkönig oder der eiserne Heinrich  Ein Märchen der Brüder Grimm Brüder Grimm  In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön; aber die"

Downloading and loading a German model

Code

# download German model (run once)
udpipe::udpipe_download_model(language = "german-hdt")

Code

# load German model from disk
m_ger <- udpipe_load_model(file = here::here(
  "udpipemodels",
  "german-gsd-ud-2.5-191206.udpipe"
))

Generating the German dictionary

The tagging, filtering, and summarising steps are identical to the English pipeline — only the model and input text change:

Code

grimm_ann <- udpipe::udpipe_annotate(m_ger, x = grimm) |>
  as.data.frame() |>
  dplyr::filter(!stringr::str_detect(token, "\\W")) |>
  dplyr::filter(!stringr::str_detect(token, "[0-9]")) |>
  dplyr::group_by(token, lemma, upos) |>
  dplyr::summarise(frequency = dplyr::n(), .groups = "drop") |>
  dplyr::arrange(lemma)
# inspect
head(grimm_ann, 10)

# A tibble: 10 × 4
   token           lemma           upos  frequency
   <chr>           <chr>           <chr>     <int>
 1 A               A               NOUN          1
 2 Abend           Abend           NOUN          3
 3 Abschied        Abschied        NOUN          1
 4 Ade             Ade             NOUN          2
 5 Allergnädigster Allergnädigster ADJ           1
 6 Alte            Alte            NOUN          1
 7 Angst           Angst           NOUN          1
 8 Antwort         Antwort         NOUN          1
 9 Anwesenden      Anwesende       NOUN          1
10 Anzahl          Anzahl          NOUN          1

The result is a German dictionary derived from the Grimm fairy tales, holding the word form, POS tag, lemma, and frequency — the same structure as the English dictionary. The same enrichment steps (adding sentiment, comments, translations) can be applied directly.

Going Further: Crowd-Sourced Dictionaries

Section Overview

What you will learn: How the dictionary-generation approach described in this tutorial can be extended to collaborative, crowd-sourced dictionary projects using Git and GitHub.

The dictionary-generation workflow presented in this tutorial can be extended to crowd-sourced dictionary projects. By hosting the dictionary in a Git repository on GitHub or GitLab, you can allow any researcher with an account to contribute entries or corrections.

Contributors fork the repository, make their additions or corrections, and submit a pull request. The repository owner reviews each proposed change and decides whether to accept it — maintaining quality control while enabling distributed contribution. Because Git is a version control system, any erroneously accepted change can be reverted instantly.

This is particularly well-suited to the computational lexicography workflow presented here. The R-generated dictionary provides an accurate, automatically produced starting point; the crowd-sourcing layer adds human expert review, corrections, and extensions that automated methods cannot provide. RStudio’s built-in Git integration makes this pipeline accessible without command-line expertise — see Happy Git and GitHub for the useR for a practical guide.

Citation & Session Info

Citation

Martin Schweinberger. 2026. Lexicography with R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/lexicography/lexicography.html (Version 3.1.1). doi: 10.5281/zenodo.19332901.

@manual{martinschweinberger2026lexicography,
  author       = {Martin Schweinberger},
  title        = {Lexicography with R},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/lexicography/lexicography.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {3.1.1},
  doi          = {10.5281/zenodo.19332901}
}

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] plyr_1.8.9       flextable_0.9.11 coop_0.6-3       tidyr_1.3.2     
[5] tidytext_0.4.3   udpipe_0.8.11    stringr_1.6.0    dplyr_1.2.0     
[9] checkdown_0.0.13

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.56               ggplot2_4.0.2          
 [4] htmlwidgets_1.6.4       lattice_0.22-6          vctrs_0.7.2            
 [7] tools_4.4.2             generics_0.1.4          tibble_3.3.1           
[10] janeaustenr_1.0.0       pkgconfig_2.0.3         tokenizers_0.3.0       
[13] Matrix_1.7-2            data.table_1.17.0       RColorBrewer_1.1-3     
[16] S7_0.2.1                uuid_1.2-1              lifecycle_1.0.5        
[19] compiler_4.4.2          farver_2.1.2            textshaping_1.0.0      
[22] codetools_0.2-20        litedown_0.9            fontquiver_0.2.1       
[25] fontLiberation_0.1.0    SnowballC_0.7.1         htmltools_0.5.9        
[28] yaml_2.3.10             pillar_1.11.1           openssl_2.3.2          
[31] fontBitstreamVera_0.1.1 commonmark_2.0.0        tidyselect_1.2.1       
[34] zip_2.3.2               digest_0.6.39           stringi_1.8.7          
[37] purrr_1.2.1             rprojroot_2.1.1         fastmap_1.2.0          
[40] grid_4.4.2              here_1.0.2              cli_3.6.5              
[43] magrittr_2.0.4          patchwork_1.3.0         utf8_1.2.6             
[46] withr_3.0.2             gdtools_0.5.0           scales_1.4.0           
[49] rmarkdown_2.30          officer_0.7.3           askpass_1.2.1          
[52] ragg_1.5.1              evaluate_1.0.5          knitr_1.51             
[55] markdown_2.0            rlang_1.1.7             Rcpp_1.1.1             
[58] glue_1.8.0              BiocManager_1.30.27     xml2_1.3.6             
[61] renv_1.1.7              rstudioapi_0.17.1       jsonlite_2.0.0         
[64] R6_2.6.1                systemfonts_1.3.1

AI Transparency Statement

This tutorial was revised and expanded with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to fix two deprecated function calls (mutate_each() replaced with mutate(across(...)) and str_remove_all(., "[:punct:]") corrected to str_remove_all("[[:punct:]]")), rewrite . placeholder usage for compatibility with the native |> pipe (including removing the plyr::join(., ...) call by replacing it with a two-step left_join), move library(plyr) to the setup chunk, add Learning Objectives and Prerequisite callouts, replace <div class="warning"> and <div class="question"> HTML blocks with Quarto callouts, add section overview callouts, add six checkdown exercises, expand and clarify the prose explanations throughout, standardise chunk labels, fix the BibTeX comma bug, and align the document style with other LADAL tutorials. The YAML header and all content after the Citation heading were left unchanged. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for the tutorial’s accuracy and pedagogical appropriateness.

Back to top

Back to HOME

References

Agnes, Michael, Jonathan L Goldman, and Katherine Soltis. 2002. Webster’s New World Compact Desk Dictionary and Style Guide. Hungry Minds.

Amsler, Robert Alfred. 1981. The Structure of the Merriam-Webster Pocket Dictionary. Austin, TX: he University of Texas at Austin.

Bullinaria, J. A., and J. P. Levy. 2007. “Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study.” Behavior Research Methods 39: 510–26. https://doi.org/https://doi.org/10.3758/bf03193020.

Firth, John R. 1957. “A Synopsis of Linguistic Theory, 1930–1955.” In Studies in Linguistic Analysis, 1–32. Oxford: Blackwell.

Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company.

Liu, Bing, and Lei Zhang. 2012. “A Survey of Opinion Mining and Sentiment Analysis.” In Mining Text Data, 415–63. Springer.

Rajeg, Gede Primahadi Wijaya, Karlina Denistia, and Simon Musgrave. 2019. “R Markdown Notebook for Vector Space Model and the Usage Patterns of Indonesian Denominal Verbs.” https://doi.org/10.6084/m9.figshare.9970205.v1.

Steiner, Roger J. 1985. “Dictionaries. The Art and Craft of Lexicography.” Dictionaries: Journal of the Dictionary Society of North America 7 (1): 294–300. https://doi.org/https://doi.org/10.2307/3735704.