# Introduction

This tutorial introduces collocation and co-occurrence analysis with R and shows how to extract and visualize semantic links between words.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to extract and analyze collocations and N-grams from textual data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with collocation analysis.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.

Click this link to open an interactive version of this tutorial on MyBinder.org.
This interactive Jupyter notebook allows you to execute code yourself and you can also change and edit the notebook, e.g. you can change code and upload your own data.

Parts of this tutorial build on and use materials from this tutorial on co-occurrence analysis with R by Andreas Niekler and Gregor Wiedemann .

How can you determine if words occur more frequently together than would be expected by chance?

This tutorial aims to show how you can answer this question.

So, how would you find words that are associated with a specific term and how can you visualize such word nets? This tutorial focuses on co-occurrence and collocations of words. Collocations are words that occur very frequently together. For example, Merry Christmas is a collocation because merry and Christmas occur more frequently together than would be expected by chance. This means that if you were to shuffle all words in a corpus and would then test the frequency of how often merry and Christmas co-occurred, they would occur significantly less often in the shuffled or randomized corpus than in a corpus that contain non-shuffled natural speech.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print=1000)
# install packages
install.packages("FactoMineR")
install.packages("factoextra")
install.packages("flextable")
install.packages("GGally")
install.packages("ggdendro")
install.packages("igraph")
install.packages("network")
install.packages("Matrix")
install.packages("quanteda")
install.packages("quanteda.textstats")
install.packages("dplyr")
install.packages("stringr")
install.packages("tm")
install.packages("sna")
install.packages(
repos=NULL,
type="source"
)
install.packages("tidytext")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

# load packages
library(FactoMineR)
library(factoextra)
library(flextable)
library(GGally)
library(ggdendro)
library(igraph)
library(network)
library(Matrix)
library(quanteda)
library(dplyr)
library(stringr)
library(tm)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R, RStudio, and once you have initiated the session by executing the code shown above, you are good to go.

# 1 Extracting N-Grams and Collocations

Collocations are terms that co-occur (significantly) more often together than would be expected by chance. A typical example of a collocation is Merry Christmas because the words merry and Christmas occur together more frequently together than would be expected, if words were just randomly stringed together.

N-grams are related to collocates in that they represent words that occur together (bi-grams are two words that occur together, tri-grams three words and so on). Fortunately, creating N-gram lists is very easy. We will use the Charles Darwin’s On the Origin of Species by Means of Natural Selection as a data source and begin by generating a bi-gram list. As a first step, we load the data and split it into individual words.

# read in text
paste0(collapse = " ") %>%
stringr::str_squish() %>%
stringr::str_remove_all("- ")
# further processing
darwin_split <- darwin %>%
as_tibble() %>%
tidytext::unnest_tokens(words, value)
 words the origin of species by charles darwin an historical sketch of the progress of opinion

We can create bi-grams (N-grams consisting of two elements) by pasting every word together with the word that immediately follows it. To do this, we could use a function that is already available but we can also very simply do this manually. The first step in creating bigrams manually consists in creating a table which holds every word in a corpus and the word that immediately precedes it.

# create data frame
darwin_words <- darwin_split %>%
dplyr::rename(word1 = words) %>%
dplyr::mutate(word2 = c(word1[2:length(word1)], NA)) %>%
na.omit()
 word1 word2 the origin origin of of species species by by charles charles darwin darwin an an historical historical sketch sketch of of the the progress progress of of opinion opinion on

We can the paste the elements in these two columns together and also inspect the frequency of each bigram.

darwin2grams <- darwin_words %>%
dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
dplyr::group_by(bigram) %>%
dplyr::summarise(frequency = n()) %>%
dplyr::arrange(-frequency)
 bigram frequency of the 2,673 in the 1,440 the same 959 to the 790 on the 743 have been 624 that the 574 it is 500 natural selection 405 and the 350 from the 346 in a 339 of a 337 with the 336 to be 324

The table show that among the 15 most frequent bigrams, there is only a single bi-gram (natural selection) which does not involve stop words (words that do not have referential but only relational meaning).

We can remove bigrams that include stops words rather easily as shown below.

# define stopwords
stps <- paste0(tm::stopwords(kind = "en"), collapse = "\\b|\\b")
# clean bigram table
darwin2grams_clean <- darwin2grams %>%
dplyr::filter(!str_detect(bigram, stps))
 bigram frequency natural selection 405 organic beings 107 distinct species 106 one species 74 closely allied 73 widely different 52 allied species 48 can understand 47 can see 44 new species 44 south america 44 modified descendants 43 physical conditions 42 fresh water 40 will generally 38

## Extracting N-Grams with quanteda

The quanteda package (see Benoit et al. 2018) offers excellent and very fast functions for extracting bigrams.

#clean corpus
darwin_clean <- darwin %>%
stringr::str_to_title()
# tokenize corpus
darwin_tokzd <- quanteda::tokens(darwin_clean)
# extract bigrams
BiGrams <- darwin_tokzd %>%
quanteda::tokens_remove(stopwords("en")) %>%
quanteda::tokens_select(pattern = "^[A-Z]",
valuetype = "regex",
case_insensitive = FALSE,
quanteda.textstats::textstat_collocations(min_count = 5, tolower = FALSE)
 collocation count count_nested length lambda z Natural Selection 405 0 2 8.042955 59.11301 Conditions Life 119 0 2 5.941802 43.44276 Organic Beings 107 0 2 8.495800 38.54199 Closely Allied 64 0 2 6.796989 35.14891 South America 44 0 2 7.632984 30.34581 Widely Different 51 0 2 5.472490 29.73579 Modified Descendants 41 0 2 6.107450 28.83663 Distinct Species 105 0 2 3.446325 28.82699 State Nature 45 0 2 5.390340 28.29911 Theory Natural 52 0 2 4.820828 27.64616 Individual Differences 35 0 2 6.090381 27.36223 North America 31 0 2 7.164715 26.78067 Reason Believe 34 0 2 6.407312 26.77187 Forms Life 57 0 2 3.987977 26.18518 Throughout World 30 0 2 6.418136 26.02976

We can also extract bigrams very easily using the tokens_compound function which understands that we are looking for two-word expressions.

ngram_extract <- quanteda::tokens_compound(darwin_tokzd, pattern = BiGrams)

We can now generate concordances (and clean the resulting kwic table - the keyword-in-context table).

ngram_kwic <- kwic(ngram_extract, pattern = c("Natural_Selection", "South_America")) %>%
as.data.frame() %>%
dplyr::select(-to, -from, -pattern)
 docname pre keyword post text1 Distribution Of The Organic_Beings Inhabiting South_America , And In The Geological text1 We Shall Then See How Natural_Selection Almost_Inevitably Causes Much_Extinction Of The text1 , I Am Convinced That Natural_Selection Has Been The Most Important text1 By A Process Of " Natural_Selection , " As Will Hereafter text1 They Thus Aft'ord Materials For Natural_Selection To Act On And Accumulate text1 On And Rendered Definite By Natural_Selection , As Hereafter To Be text1 To The Cumulative Action Of Natural_Selection , Hereafter To Be Explained text1 For Existence Its Bearing On Natural_Selection - The Term Used In text1 Struggle For Existence Bears On Natural_Selection . It Has Been Seen text1 Preserved , By The Term Natural_Selection , In Order To Mark text1 Hand Of Nature . But Natural_Selection , As We Shall_Hereafter_See , text1 Slow-Breeding Cattle And Horses In South_America , And Latterly In Australia text1 To The Feral Animals Of South_America . Here I Will Make text1 Have Observeu In Parts Of South_America ) The Vegetation : This text1 And Multiply . Chapter Iv Natural_Selection ; Or The Survival Of

The disadvantage here is that we are strictly speaking only extracting N-Grams but not collocates as collocates do not necessarily have to occur in direct adjacency. The following section shows how to expand the extraction of n-grams to the extraction of collocates.

# 2 Finding Collocations

Both N-grams and collocations are not only an important concept in language teaching but they are also fundamental in Text Analysis and many other research areas working with language data. Unfortunately, words that collocate do not have to be immediately adjacent but can also encompass several slots. This is unfortunate because it makes retrieval of collocates substantially more difficult compared with a situation in which we only need to extract words that occur right next to each other.

In the following, we will extract collocations from Charles Darwin’s On the Origin of Species by Means of Natural Selection. In a first step, we will split the Origin into individual sentences.

# read in and process text
darwinsentences <- darwin %>%
stringr::str_squish() %>%
tokenizers::tokenize_sentences(.) %>%
unlist() %>%
stringr::str_remove_all("- ") %>%
stringr::str_replace_all("\\W", " ") %>%
stringr::str_squish()
# inspect data
head(darwinsentences)
## [1] "THE ORIGIN OF SPECIES BY CHARLES DARWIN AN HISTORICAL SKETCH OF THE PROGRESS OF OPINION ON THE ORIGIN OF SPECIES INTRODUCTION When on board H M S"
## [2] "Beagle as naturalist I was much struck with certain facts in the distribution of the organic beings inhabiting South America and in the geological relations of the present to the past inhabitants of that continent"
## [3] "These facts as will be seen in the latter chapters of this volume seemed to throw some light on the origin of species that mystery of mysteries as it has been called by one of our greatest philosophers"
## [4] "On my return home it occurred to me in 1837 that something might perhaps be made out on this question by patiently accumulating and reflecting on all sorts of facts which could possibly have any bearing on it"
## [5] "After five years work I allowed myself to speculate on the subject and drew up some short notes these I enlarged in 1844 into a sketch of the conclusions which then seemed to me probable from that period to the present day I have steadily pursued the same object"
## [6] "I hope that I may be excused for entering on these personal details as I give them to show that I have not been hasty in coming to a decision"

The first element does not represent a full sentence because we selected a sample of the text which began in the middle of a sentence rather than at its beginning. In a next step, we will create a matrix that shows how often each word co-occurred with each other word in the data.

# convert into corpus
darwincorpus <- Corpus(VectorSource(darwinsentences))
# create vector with words to remove
extrawords <- c("the", "can", "get", "got", "can", "one",
"dont", "even", "may", "but", "will",
"much", "first", "but", "see", "new",
"many", "less", "now", "well", "like",
"often", "every", "said", "two")
# clean corpus
darwincorpusclean <- darwincorpus %>%
tm::tm_map(removePunctuation) %>%
tm::tm_map(removeNumbers) %>%
tm::tm_map(tolower) %>%
tm::tm_map(removeWords, stopwords()) %>%
tm::tm_map(removeWords, extrawords)
# create document term matrix
darwindtm <- DocumentTermMatrix(darwincorpusclean, control=list(bounds = list(global=c(1, Inf)), weighting = weightBin))

# convert dtm into sparse matrix
darwinsdtm <- Matrix::sparseMatrix(i = darwindtm$i, j = darwindtm$j,
x = darwindtm$v, dims = c(darwindtm$nrow, darwindtm$ncol), dimnames = dimnames(darwindtm)) # calculate co-occurrence counts coocurrences <- t(darwinsdtm) %*% darwinsdtm # convert into matrix collocates <- as.matrix(coocurrences)  Word board charles darwin historical introduction opinion origin progress sketch species board 2 2 1 1 1 1 1 1 1 1 charles 2 9 1 1 1 1 1 1 1 2 darwin 1 1 3 1 1 1 2 1 1 1 historical 1 1 1 4 1 1 2 1 1 2 introduction 1 1 1 1 7 1 2 1 1 4 opinion 1 1 1 1 1 10 1 1 1 6 origin 1 1 2 2 2 1 55 1 1 23 progress 1 1 1 1 1 1 1 19 1 6 sketch 1 1 1 1 1 1 1 1 4 1 species 1 2 1 2 4 6 23 6 1 1,292 We can inspect this co-occurrence matrix and check how many terms (words or elements) it represents using the ncol function from base R. We can also check how often terms occur in the data using the summary function from base R. The output of the summary function tells us that the minimum frequency of a word in the data is 1 with a maximum of 25,435. The difference between the median (36.00) and the mean (74.47) indicates that the frequencies are distributed very non-normally - which is common for language data. # inspect size of matrix ncol(collocates) ## [1] 8638 summary(rowSums(collocates)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.0 22.0 44.0 204.3 137.0 24489.0 The ncol function reports that the data represents 8,638 words and that the most frequent word occurs 25,435 times in the text. # 3 Visualizing Collocations We will now use an example of one individual word (selection) to show, how collocation strength for individual terms is calculated and how it can be visualized. The function calculateCoocStatistics is taken from this tutorial (see also Wiedemann and Niekler 2017). # load function for co-occurrence calculation source("https://slcladal.github.io/rscripts/calculateCoocStatistics.R") # define term coocTerm <- "selection" # calculate co-occurrence statistics coocs <- calculateCoocStatistics(coocTerm, darwinsdtm, measure="LOGLIK") # inspect results coocs[1:20] ## natural theory variations effects modifications ## 1766.44751 127.86330 124.94947 69.94048 63.52192 ## acts power slight advantage disuse ## 53.15165 53.14602 48.94158 48.21207 47.84167 ## accumulated sexual variation principle useful ## 46.99429 44.31103 43.64372 42.17271 39.36931 ## preservation survival structure action favourable ## 37.35337 36.59873 36.37859 35.18738 35.06886 The output shows that the word most strongly associated with selection in Charles Darwin’s Origin is unsurprisingly natural - given the substantive strength of the association between natural and selection these term are definitely collocates and almost - if not already - a lexicalized construction (at least in this text). There are various visualizations options for collocations. Which visualization method is appropriate depends on what the visualizations should display. ## Association Strength We start with the most basic and visualize the collocation strength using a simple dot chart. We use the vector of association strengths generated above and transform it into a table. Also, we exclude elements with an association strength lower than 30. coocdf <- coocs %>% as.data.frame() %>% dplyr::mutate(CollStrength = coocs, Term = names(coocs)) %>% dplyr::filter(CollStrength > 30)  Term CollStrength natural 1,766.44751 theory 127.86330 variations 124.94947 effects 69.94048 modifications 63.52192 acts 53.15165 power 53.14602 slight 48.94158 advantage 48.21207 disuse 47.84167 accumulated 46.99429 sexual 44.31103 variation 43.64372 principle 42.17271 useful 39.36931 We can now visualize the association strengths as shown in the code chunk below. ggplot(coocdf, aes(x = reorder(Term, CollStrength, mean), y = CollStrength)) + geom_point() + coord_flip() + theme_bw() + labs(y = "") The dot chart shows that natural is collocating more strongly with selection compared to any other term. This confirms that natural and selection form a collocation in Darwin’s Origin. ## Dendrograms Another method for visualizing collocations are dendrograms. Dendrograms (also called tree-diagrams) show how similar elements are based on one or many features. As such, dendrograms are used to indicate groupings as they show elements (words) that are notably similar or different with respect to their association strength. To use this method, we first need to generate a distance matrix from our co-occurrence matrix. coolocs <- c(coocdf$Term, "selection")
# remove non-collocating terms
collocates_redux <- collocates[rownames(collocates) %in% coolocs, ]
collocates_redux <- collocates_redux[, colnames(collocates_redux) %in% coolocs]
# create distance matrix
distmtx <- dist(collocates_redux)

clustertexts <- hclust(    # hierarchical cluster object
distmtx,                 # use distance matrix as data
method="ward.D2")        # ward.D as linkage method

ggdendrogram(clustertexts) +
ggtitle("Terms strongly collocating with *selection*")

## Network Graphs

Network graphs are a very useful tool to show relationships (or the absence of relationships) between elements. Network graphs are highly useful when it comes to displaying the relationships that words have among each other and which properties these networks of words have.

### Basic Network Graphs

In order to display a network, we need to create a network graph by using the network function from the network package.

net = network::network(collocates_redux,
directed = FALSE,
ignore.eval = FALSE,
names.eval = "weights")
# vertex names
network.vertex.names(net) = rownames(collocates_redux)
# inspect object
net
##  Network attributes:
##   vertices = 26
##   directed = FALSE
##   hyper = FALSE
##   loops = FALSE
##   multiple = FALSE
##   bipartite = FALSE
##   total edges= 265
##     missing edges= 0
##     non-missing edges= 265
##
##  Vertex attribute names:
##     vertex.names
##
##  Edge attribute names:
##     weights

Now that we have generated a network object, we visualize the network.

ggnet2(net,
label = TRUE,
label.size = 4,
alpha = 0.2,
size.cut = 3,
edge.alpha = 0.3) +
guides(color = FALSE, size = FALSE)

The network is already informative but we will customize the network object so that the visualization becomes more appealing and informative. To add information, we create vector of words that contain different groups, e.g. terms that rarely, sometimes, and frequently collocate with selection (I used the dendrogram which displayed the cluster analysis as the basis for the categorization).

Based on these vectors, we can then change or adapt the default values of certain attributes or parameters of the network object (e.g. weights. linetypes, and colors).

# create vectors with collocation occurrences as categories
mid <- c("theory", "variations", "slight", "variation")
high <- c("natural", "selection")
infreq <- colnames(collocates_redux)[!colnames(collocates_redux) %in% mid & !colnames(collocates_redux) %in% high]
net %v% "Collocation" = ifelse(network.vertex.names(net) %in% infreq, "weak",
ifelse(network.vertex.names(net) %in% mid, "medium",
ifelse(network.vertex.names(net) %in% high, "strong", "other")))
# modify color
net %v% "color" = ifelse(net %v% "Collocation" == "weak", "gray60",
ifelse(net %v% "Collocation" == "medium", "orange",
ifelse(net %v% "Collocation" == "strong", "indianred4", "gray60")))
# rescale edge size
network::set.edge.attribute(net, "weights", ifelse(net %e% "weights" < 1, 0.1,
ifelse(net %e% "weights" <= 2, .5, 1)))
# define line type
network::set.edge.attribute(net, "lty", ifelse(net %e% "weights" <=.1, 3,
ifelse(net %e% "weights" <= .5, 2, 1)))

We can now display the network object and make use of the added information.

ggnet2(net,
color = "color",
label = TRUE,
label.size = 4,
alpha = 0.2,
size = "degree",
edge.size = "weights",
edge.lty = "lty",
edge.alpha = 0.2) +
guides(color = FALSE, size = FALSE)
## Warning: guides(<scale> = FALSE) is deprecated. Please use guides(<scale> =
## "none") instead.

## Biplots

An alternative way to display co-occurrence patterns are bi-plots which are used to display the results of Correspondence Analyses. They are useful, in particular, when one is not interested in one particular key term and its collocations but in the overall similarity of many terms. Semantic similarity in this case refers to a shared semantic and this distributional profile. As such, words can be deemed semantically similar if they have a similar co-occurrence profile - i.e. they co-occur with the same elements. Biplots can be used to visualize collocations because collocates co-occur and thus share semantic properties which renders then more similar to each other compared with other terms.

# perform correspondence analysis
res.ca <- CA(collocates_redux, graph = FALSE)
# plot results
fviz_ca_row(res.ca, repel = TRUE, col.row = "gray20")

The bi-plot shows that natural and selection collocate as they are plotted in close proximity. The advantage of the biplot becomes apparent when we focus on other terms because the biplot also shows other collocates such as vary and independently or might injurious.

# 4 Determining Significance

In order to identify which words occur together significantly more frequently than would be expected by chance, we have to determine if their co-occurrence frequency is statistical significant. This can be done wither for specific key terms or it can be done for the entire data. In this example, we will continue to focus on the key word selection.

To determine which terms collocate significantly with the key term (selection), we use multiple (or repeated) Fisher’s Exact tests which require the following information:

• a = Number of times coocTerm occurs with term j

• b = Number of times coocTerm occurs without term j

• c = Number of times other terms occur with term j

• d = Number of terms that are not coocTerm or term j

In a first step, we create a table which holds these quantities.

# convert to data frame
coocdf <- as.data.frame(as.matrix(collocates))
# reduce data
diag(coocdf) <- 0
coocdf <- coocdf[which(rowSums(coocdf) > 10),]
coocdf <- coocdf[, which(colSums(coocdf) > 10)]
# extract stats
cooctb <- coocdf %>%
dplyr::mutate(Term = rownames(coocdf)) %>%
tidyr::gather(CoocTerm, TermCoocFreq,
colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]) %>%
dplyr::mutate(Term = factor(Term),
CoocTerm = factor(CoocTerm)) %>%
dplyr::mutate(AllFreq = sum(TermCoocFreq)) %>%
dplyr::group_by(Term) %>%
dplyr::mutate(TermFreq = sum(TermCoocFreq)) %>%
dplyr::ungroup(Term) %>%
dplyr::group_by(CoocTerm) %>%
dplyr::mutate(CoocFreq = sum(TermCoocFreq)) %>%
dplyr::arrange(Term) %>%
dplyr::mutate(a = TermCoocFreq,
b = TermFreq - a,
c = CoocFreq - a,
d = AllFreq - (a + b + c)) %>%
dplyr::mutate(NRows = nrow(coocdf))
 Term CoocTerm TermCoocFreq AllFreq TermFreq CoocFreq a b c d NRows abdomen board 0 1,672,702 71 38 0 71 38 1,672,593 8,117 abdomen charles 0 1,672,702 71 161 0 71 161 1,672,470 8,117 abdomen darwin 0 1,672,702 71 71 0 71 71 1,672,560 8,117 abdomen historical 0 1,672,702 71 78 0 71 78 1,672,553 8,117 abdomen introduction 0 1,672,702 71 158 0 71 158 1,672,473 8,117 abdomen opinion 0 1,672,702 71 143 0 71 143 1,672,488 8,117 abdomen origin 0 1,672,702 71 1,014 0 71 1,014 1,671,617 8,117 abdomen progress 0 1,672,702 71 426 0 71 426 1,672,205 8,117 abdomen sketch 0 1,672,702 71 57 0 71 57 1,672,574 8,117 abdomen species 0 1,672,702 71 23,146 0 71 23,146 1,649,485 8,117 abdomen america 0 1,672,702 71 2,263 0 71 2,263 1,670,368 8,117 abdomen beagle 0 1,672,702 71 45 0 71 45 1,672,586 8,117 abdomen beings 0 1,672,702 71 3,138 0 71 3,138 1,669,493 8,117 abdomen certain 0 1,672,702 71 5,110 0 71 5,110 1,667,521 8,117 abdomen continent 0 1,672,702 71 1,172 0 71 1,172 1,671,459 8,117

We now select the key term (selection). If we wanted to find all collocations that are present in the data, we would use the entire data rather than only the subset that contains selection.

cooctb_redux <- cooctb %>%
dplyr::filter(Term == coocTerm)

Next, we calculate which terms are (significantly) over- and under-proportionately used with selection. It is important to note that this procedure informs about both: over- and under-use! This is especially crucial when analyzing if specific words are attracted o repelled by certain constructions. Of course, this approach is not restricted to analyses of constructions and it can easily be generalized across domains and has also been used in machine learning applications.

coocStatz <- cooctb_redux %>%
dplyr::rowwise() %>%
dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d),
ncol = 2, byrow = T))[1]))) %>%
dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d),                                                           ncol = 2, byrow = T))[1]))) %>%
dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>% dplyr::mutate(Significance = dplyr::case_when(p <= .001 ~ "p<.001", p <= .01 ~ "p<.01", p <= .05 ~ "p<.05", FALSE ~ "n.s."))  Term CoocTerm TermCoocFreq AllFreq TermFreq CoocFreq a b c d NRows p x2 phi expected Significance selection board 0 1,672,702 9,539 38 0 9,539 38 1,663,125 8,117 1.000000000000 0.0000000000000000000000000059117006 0.00000000000000005944932 0.2167045 selection charles 1 1,672,702 9,539 161 1 9,538 160 1,663,003 8,117 0.601805521205 0.0000000000000000000000000028093697 0.00000000000000004098219 0.9181426 selection darwin 2 1,672,702 9,539 71 2 9,537 69 1,663,094 8,117 0.062404007024 2.9790030294663205623351132089737803 0.00133452337259740865839 0.4048952 selection historical 0 1,672,702 9,539 78 0 9,539 78 1,663,085 8,117 1.000000000000 0.0000000000000000000000008245819898 0.00000000000000070211400 0.4448144 selection introduction 1 1,672,702 9,539 158 1 9,538 157 1,663,006 8,117 0.594914005064 0.0000000000000000000000000028358828 0.00000000000000004117512 0.9010344 selection opinion 0 1,672,702 9,539 143 0 9,539 143 1,663,020 8,117 1.000000000000 0.1227666518195233008592381906964874 0.00027091366882623491410 0.8154931 selection origin 6 1,672,702 9,539 1,014 6 9,533 1,008 1,662,155 8,117 0.833610762419 0.0000000000000000000015383310175749 0.00000000000003032603350 5.7825877 selection progress 4 1,672,702 9,539 426 4 9,535 422 1,662,741 8,117 0.314532698062 0.4746553539784738040552269922045525 0.00053269657428090227535 2.4293712 selection sketch 0 1,672,702 9,539 57 0 9,539 57 1,663,106 8,117 1.000000000000 0.0000000000000000000000000006332213 0.00000000000000001945667 0.3250567 selection species 103 1,672,702 9,539 23,146 103 9,436 23,043 1,640,120 8,117 0.009462805016 6.2739025542921140754515363369137049 0.00193668770922403852587 131.9958331 p<.01 selection america 0 1,672,702 9,539 2,263 0 9,539 2,263 1,660,900 8,117 0.000004226217 12.0093343092463467058905735029838979 0.00267947789904453585394 12.9053214 p<.001 selection beagle 0 1,672,702 9,539 45 0 9,539 45 1,663,118 8,117 1.000000000000 0.0000000000000000000000015459347743 0.00000000000000096136055 0.2566237 selection beings 23 1,672,702 9,539 3,138 23 9,516 3,115 1,660,048 8,117 0.233001593264 1.1939290102190491804634575601085089 0.00084485069240517428335 17.8952270 selection certain 25 1,672,702 9,539 5,110 25 9,514 5,085 1,658,078 8,117 0.514243498655 0.4589446105716613599767583764332812 0.00052380645861721379132 29.1410484 selection continent 1 1,672,702 9,539 1,172 1 9,538 1,171 1,661,992 8,117 0.018143903119 4.0461589334146630392297083744779229 0.00155529286931237404126 6.6836221 p<.05 We now add information to the table and remove superfluous columns s that the table can be more easily parsed. coocStatz <- coocStatz %>% dplyr::ungroup() %>% dplyr::arrange(p) %>% dplyr::mutate(j = 1:n()) %>% # perform benjamini-hochberg correction dplyr::mutate(corr05 = ((j/NRows)*0.05)) %>% dplyr::mutate(corr01 = ((j/NRows)*0.01)) %>% dplyr::mutate(corr001 = ((j/NRows)*0.001)) %>% # calculate corrected significance status dplyr::mutate(CorrSignificance = dplyr::case_when(p <= corr001 ~ "p<.001", p <= corr01 ~ "p<.01", p <= corr05 ~ "p<.05", FALSE ~ "n.s.")) %>% dplyr::mutate(p = round(p, 6)) %>% dplyr::mutate(x2 = round(x2, 1)) %>% dplyr::mutate(phi = round(phi, 2)) %>% dplyr::arrange(p) %>% dplyr::select(-a, -b, -c, -d, -j, -NRows, -corr05, -corr01, -corr001) %>% dplyr::mutate(Type = ifelse(expected > TermCoocFreq, "Antitype", "Type"))  Term CoocTerm TermCoocFreq AllFreq TermFreq CoocFreq p x2 phi expected Significance CorrSignificance Type selection natural 381 1,672,702 9,539 9,535 0 1,978.5 0.03 54.375714 p<.001 p<.001 Type selection selection 0 1,672,702 9,539 9,539 0 54.0 0.01 54.398525 p<.001 p<.001 Antitype selection theory 65 1,672,702 9,539 2,862 0 143.3 0.01 16.321268 p<.001 p<.001 Type selection variations 63 1,672,702 9,539 3,004 0 121.1 0.01 17.131059 p<.001 p<.001 Type selection effects 33 1,672,702 9,539 1,525 0 65.6 0.01 8.696693 p<.001 p<.001 Type selection unconscious 16 1,672,702 9,539 345 0 93.6 0.01 1.967448 p<.001 p<.001 Type selection modifications 32 1,672,702 9,539 1,595 0 55.6 0.01 9.095885 p<.001 p<.001 Type selection acts 19 1,672,702 9,539 615 0 64.5 0.01 3.507191 p<.001 p<.001 Type selection power 32 1,672,702 9,539 1,766 0 45.9 0.01 10.071055 p<.001 p<.001 Type selection principle 31 1,672,702 9,539 1,704 0 44.7 0.01 9.717485 p<.001 p<.001 Type selection sexual 21 1,672,702 9,539 878 0 48.2 0.01 5.007014 p<.001 p<.001 Type selection disuse 23 1,672,702 9,539 1,055 0 45.4 0.01 6.016400 p<.001 p<.001 Type selection variation 34 1,672,702 9,539 2,152 0 37.0 0.00 12.272316 p<.001 p<.001 Type selection accumulated 19 1,672,702 9,539 798 0 43.0 0.01 4.550794 p<.001 p<.001 Type selection methodical 10 1,672,702 9,539 206 0 59.3 0.01 1.174766 p<.001 p<.001 Type The results show that selection collocates significantly with selection (of course) but also, as expected, with natural. The corrected p-values shows that after Benjamini-Hochberg correction for multiple/repeated testing these are the only significant collocates of selection. Corrections are necessary when performing multiple tests because otherwise, the reliability of the test result would be strongly impaired as repeated testing causes substantive $$\alpha$$-error inflation. The Benjamini-Hochberg correction that has been used here is preferable over the more popular Bonferroni correction because it is less conservative and therefore less likely to result in $$\beta$$-errors (see again Field, Miles, and Field 2012). # 5 Changes in Collocation Strength We now turn to analyses of changes in collocation strength over apparent time. The example focuses on adjective amplification in Australian English. The issue we will analyze here is whether we can unearth changes in the collocation pattern of adjective amplifiers such as very, really, or so. In other words, we will investigate if amplifiers associate with different adjectives among speakers from different age groups. In a first step, we activate packages and load the data. # load functions source("https://SLCLADAL.github.io/rscripts/collexcovar.R") # load data ampaus <- base::readRDS(url("https://slcladal.github.io/data/ozd.rda", "rb"))  Adjective Variant Age good really 26-40 good other 26-40 good other 26-40 other pretty 26-40 other other 26-40 good other 26-40 good pretty 26-40 nice very 17-25 other so 17-25 other so 41-80 other very 17-25 good other 41-80 good other 41-80 other so 26-40 other really 17-25 The data consists of three variables (Adjective, Variant, and Age). In a next step, we perform a co-varying collexeme analysis for really versus all other amplifiers. For this reason, we reduce the data set and retain only The function takes a data set consisting of three columns labeled keys, colls, and time. # rename data ampaus <- ampaus %>% dplyr::rename(keys = Variant, colls = Adjective, time = Age) # perform analysis collexcovar_really <- collexcovar(data = ampaus, keyterm = "really")  time colls Freq_key Freq_other Freq_Colls p x2 phi expected CorrSignificance Type Variant 17-25 other 83 144 227 0.000035 17.8 0.21 103.9480198 p<.001 Antitype really 17-25 good 53 31 84 0.000521 12.8 0.18 38.4653465 p<.01 Type really 17-25 nice 22 17 39 0.178526 2.0 0.07 17.8589109 n.s. Type really 26-40 other 29 58 87 0.209406 2.0 0.13 32.3553719 n.s. Antitype really 41-80 bad 1 0 1 0.245614 3.1 0.23 0.2456140 n.s. Type really 41-80 hard 2 2 4 0.250186 1.5 0.16 0.9824561 n.s. Type really 26-40 bad 3 2 5 0.359479 1.2 0.10 1.8595041 n.s. Type really 26-40 hard 1 0 1 0.371901 1.7 0.12 0.3719008 n.s. Type really 17-25 funny 13 11 24 0.407421 0.7 0.04 10.9900990 n.s. Type really 41-80 other 7 26 33 0.544097 0.5 0.09 8.1052632 n.s. Antitype really 26-40 good 8 10 18 0.598505 0.5 0.06 6.6942149 n.s. Type really 17-25 bad 9 10 19 1.000000 0.0 0.01 8.7004950 n.s. Type really 17-25 hard 5 6 11 1.000000 0.0 0.00 5.0371287 n.s. Antitype really 26-40 funny 1 1 2 1.000000 0.1 0.03 0.7438017 n.s. Type really 26-40 nice 3 5 8 1.000000 0.0 0.00 2.9752066 n.s. Type really Now, that the data has the correct labels, we can continue with the implementation of the co-varying collexeme analysis. # perform analysis collexcovar_pretty <- collexcovar(data = ampaus, keyterm = "pretty") collexcovar_so <- collexcovar(data = ampaus, keyterm = "so") collexcovar_very <- collexcovar(data = ampaus, keyterm = "very") For other amplifiers, we have to change the label other to bin as the function already has a a label other. Once we have changed other to bin, we perform the analysis. ampaus <- ampaus %>% dplyr::mutate(keys = ifelse(keys == "other", "bin", keys)) collexcovar_other <- collexcovar(data = ampaus, keyterm = "bin") Next, we combine the results of the co-varying collexeme analysis into a single table. # combine tables collexcovar_ampaus <- rbind(collexcovar_really, collexcovar_very, collexcovar_so, collexcovar_pretty, collexcovar_other) collexcovar_ampaus <- collexcovar_ampaus %>% dplyr::rename(Age = time, Adjective = colls) %>% dplyr::mutate(Variant = ifelse(Variant == "bin", "other", Variant)) %>% dplyr::arrange(Age)  Age Adjective Freq_key Freq_other Freq_Colls p x2 phi expected CorrSignificance Type Variant 17-25 other 83 144 227 0.000035 17.8 0.21 103.948020 p<.001 Antitype really 17-25 good 53 31 84 0.000521 12.8 0.18 38.465347 p<.01 Type really 17-25 nice 22 17 39 0.178526 2.0 0.07 17.858911 n.s. Type really 17-25 funny 13 11 24 0.407421 0.7 0.04 10.990099 n.s. Type really 17-25 bad 9 10 19 1.000000 0.0 0.01 8.700495 n.s. Type really 17-25 hard 5 6 11 1.000000 0.0 0.00 5.037129 n.s. Antitype really 17-25 funny 0 24 24 0.034280 4.5 0.10 3.564356 n.s. Antitype very 17-25 other 40 187 227 0.090627 3.1 0.09 33.712871 n.s. Type very 17-25 bad 0 19 19 0.090928 3.5 0.09 2.821782 n.s. Antitype very 17-25 hard 3 8 11 0.214720 1.4 0.06 1.633663 n.s. Type very 17-25 good 9 75 84 0.300778 1.4 0.06 12.475248 n.s. Antitype very 17-25 nice 8 31 39 0.340605 1.1 0.05 5.792079 n.s. Type very 17-25 good 7 77 84 0.000083 14.3 0.19 20.168317 p<.01 Antitype so 17-25 other 63 164 227 0.047107 4.0 0.10 54.502475 n.s. Type so 17-25 funny 10 14 24 0.047832 4.4 0.10 5.762376 n.s. Type so We now modify the data set so that we can plot the collocation strength across apparent time. ampauscoll <- collexcovar_ampaus %>% dplyr::select(Age, Adjective, Variant, Type, phi) %>% dplyr::mutate(phi = ifelse(Type == "Antitype", -phi, phi)) %>% dplyr::select(-Type) %>% tidyr::spread(Adjective, phi) %>% tidyr::replace_na(list(bad = 0, funny = 0, hard = 0, good = 0, nice = 0, other = 0)) %>% tidyr::gather(Adjective, phi, bad:other) %>% tidyr::spread(Variant, phi) %>% tidyr::replace_na(list(pretty = 0, really = 0, so = 0, very = 0, other = 0)) %>% tidyr::gather(Variant, phi, other:very)  Age Adjective Variant phi 17-25 bad other -0.05 17-25 funny other -0.05 17-25 good other -0.05 17-25 hard other 0.04 17-25 nice other -0.07 17-25 other other 0.12 26-40 bad other 0.02 26-40 funny other -0.06 26-40 good other 0.01 26-40 hard other -0.04 26-40 nice other -0.11 26-40 other other 0.07 41-80 bad other -0.05 41-80 funny other -0.05 41-80 good other 0.00 In a final step, we visualize the results of our analysis. ggplot(ampauscoll, aes(x = reorder(Age, desc(Age)), y = phi, group = Variant, color = Variant, linetype = Variant)) + facet_wrap(vars(Adjective)) + geom_line() + guides(color=guide_legend(override.aes=list(fill=NA))) + scale_color_manual(values = c("gray70", "gray70", "gray20", "gray70", "gray20"), name="Variant", breaks = c("other", "pretty", "really", "so", "very"), labels = c("other", "pretty", "really", "so", "very")) + scale_linetype_manual(values = c("dotted", "dotdash", "longdash", "dashed", "solid"), name="Variant", breaks = c("other", "pretty", "really", "so", "very"), labels = c("other", "pretty", "really", "so", "very")) + theme(legend.position="top", axis.text.x = element_text(size=12), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + theme_set(theme_bw(base_size = 12)) + coord_cartesian(ylim = c(-.2, .4)) + labs(x = "Age", y = "Collocation Strength") + guides(size = FALSE)+ guides(alpha = FALSE) The results show that the collocation strength of different amplifier variants changes quite notably across age groups and we can also see that there is considerable variability in the way that the collocation strengths changes. For example, the collocation strengths between bad and really decreases from old to young speakers, while the reverse trend emerges for good which means that really is collocating more strongly with good among younger speakers than it is among older speakers. # 6 Collostructional Analysis Collostructional analysis investigates the lexicogrammatical associations between constructions and lexical elements and there exist three basic subtypes of collostructional analysis: • Simple Collexeme Analysis • Distinctive Collexeme Analysis • Co-Varying Collexeme Analysis The analyses performed here are based on the collostructions package . ## Simple Collexeme Analysis Simple Collexeme Analysis determines if a word is significantly attracted to a specific construction within a corpus. The idea is that the frequency of the word that is attracted to a construction is significantly higher within the construction than would be expected by chance. The example here analyzes the Go + Verb construction (e.g. Go suck a nut!). The question is which verbs are attracted to this constructions (in this case, if suck is attracted to this construction). To perform these analyses, we use the collostructions package. Information about how to download and install this package can be found here. NOTE Downloading and installing the collostructions package is a bit tricky and nor really user friendly. To install this package, go the website of Susanne Flach, who has written the collustructions package. On that website, you find different versions of that package for different operating systems (OS) like Windows and Mac. Next, download the version that is the right one for your OS, unzip the package file and copy it into your R library. Once you have copied the collustructions package in your R library, you can run the code chunks below to install and activate all the required packages and steps. Install and activate the devtools package and installing the collostructions package. # install devtools package install.packages("devtools") # load devtools package library(devtools) # install collostructions package install_local(here::here("renv/library/R-4.2/x86_64-w64-mingw32/collostructions_0.2.0.zip"), repos=NULL, type="source") We can now, finally, load the collostructions package. # load collostructions package library(collostructions) Next, we inspect the data. In this case, we will only use a sample of 100 rows from the data set as the output would become hard to read. # draw a sample of the data goVerb <- goVerb[sample(nrow(goVerb), 100),]  WORD CXN.FREQ CORP.FREQ attend 3 28,220 form 2 162,783 cash 2 31,529 endure 1 4,927 climb 6 11,750 recover 2 10,377 rake 1 1,185 stitch 1 3,286 polish 2 4,086 train 10 38,282 freeze 1 5,823 transfer 1 31,353 defend 1 16,950 refresh 2 2,871 support 10 237,775 The collex function which calculates the results of a simple collexeme analysis requires a data frame consisting out of three columns that contain in column 1 the word to be tested, in column 2 the frequency of the word in the construction (CXN.FREQ), and in column 3 the frequency of the word in the corpus (CORP.FREQ). To perform the simple collexeme analysis, we need the overall size of the corpus, the frequency with which a word occurs in the construction under investigation and the frequency of that construction. # define corpus size crpsiz <- sum(goVerb$CORP.FREQ)
# perform simple collexeme analysis
scollex_results <- collex(goVerb, corpsize = crpsiz, am = "logl",
reverse = FALSE, decimals = 5,
threshold = 1, cxn.freq = NULL,
str.dir = FALSE)
 COLLEX CORP.FREQ OBS EXP ASSOC COLL.STR.LOGL SIGNIF check 84,138 440 30.1 attr 1,687.46661 ***** grab 10,759 76 3.8 attr 313.77122 ***** fetch 2,457 34 0.9 attr 183.69957 ***** hang 13,430 54 4.8 attr 165.07365 ***** vote 50,927 63 18.2 attr 68.47607 ***** hunt 12,921 25 4.6 attr 44.05167 ***** tell 133,788 99 47.8 attr 43.86189 ***** hug 2,814 12 1.0 attr 37.65944 ***** screw 5,924 13 2.1 attr 25.54255 ***** defecate 134 3 0.0 attr 18.99614 **** indoctrinate 181 2 0.1 attr 9.88063 ** sulk 200 2 0.1 attr 9.49279 ** vandalise 30 1 0.0 attr 7.12735 ** investigate 13,959 12 5.0 attr 7.09149 ** terrorise 116 1 0.0 attr 4.45890 *

The results show which words are significantly attracted to the construction. If the ASSOC column did not show attr, then the word would be repelled by the construction.

## Covarying Collexeme Analysis

Covarying collexeme analysis determines if the occurrence of a word in the first slot of a constructions correlates with the occurrence of a word in the second slot of the construction. As such, covarying collexeme analysis analyzes constructions with two slots and how the lexical elements within the two slots affect each other.

The data we will use is called vsmdata and consist of 5,000 observations of adjectives and if the adjective is amplified. As such, vsmdata contains two columns: one column with the adjectives (Adjectives) and another column telling if the adjective has been amplified (0 means that the adjective occurred without an amplifier). The first six rows of the data are shown below.

# load data
dplyr::mutate(Amplifier = ifelse(Amplifier == 0, 0, 1))
 Amplifier Adjective 0 serious 0 sure 1 many 0 many 0 good 0 much

We now perform the collexeme analysis and inspect the results.

covar_results <- collex.covar(vsmdata)
 SLOT1 SLOT2 fS1 fS2 OBS EXP ASSOC COLL.STR.LOGL SIGNIF 0 last 4,397 308 307 270.9 attr 72.26078 ***** 1 difficult 603 54 26 6.5 attr 43.13631 ***** 0 little 4,397 309 301 271.7 attr 38.65379 ***** 0 great 4,397 179 177 157.4 attr 32.74427 ***** 0 same 4,397 171 169 150.4 attr 30.79976 ***** 0 new 4,397 182 179 160.1 attr 28.81377 ***** 1 nice 603 95 29 11.5 attr 23.34775 ***** 0 old 4,397 132 130 116.1 attr 21.51969 ***** 1 good 603 385 76 46.4 attr 20.23791 ***** 0 second 4,397 103 102 90.6 attr 19.43779 ****

The results show if a words in the first and second slot attract or repel each other (ASSOC) and provide uncorrected significance levels.

## Distinctive Collexeme Analysis

Distinctive Collexeme Analysis determines if the frequencies of items in two alternating constructions or under two conditions differ significantly. This analysis can be extended to analyze if the use of a word differs between two corpora.

Again, we use the vsmdata data.

collexdist_results <- collex.dist(vsmdata, raw = TRUE)
 COLLEX O.CXN1 E.CXN1 O.CXN2 E.CXN2 ASSOC COLL.STR.LOGL SIGNIF SHARED last 307 270.9 1 37.1 0 72.26078 ***** Y little 301 271.7 8 37.3 0 38.65379 ***** Y great 177 157.4 2 21.6 0 32.74427 ***** Y same 169 150.4 2 20.6 0 30.79976 ***** Y new 179 160.1 3 21.9 0 28.81377 ***** Y old 130 116.1 2 15.9 0 21.51969 ***** Y second 102 90.6 1 12.4 0 19.43779 **** Y sure 128 116.1 4 15.9 0 14.24614 *** Y open 45 39.6 0 5.4 0 11.62229 *** N real 92 83.5 3 11.5 0 9.83958 ** Y

The results show if words are significantly attracted or repelled by a modifier variant.

# Citation & Session Info

Schweinberger, Martin. 2022. Analyzing Co-Occurrences and Collocations in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/coll.html (Version 2022.11.18).

@manual{schweinberger2022coll,
author = {Schweinberger, Martin},
title = {Analyzing Co-Occurrences and Collocations in R},
year = {2022},
organization = {The University of Queensland, Australia. School of Languages and Cultures},
edition = {2022.11.18}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##  [1] collostructions_0.2.0 slam_0.1-50           tm_0.7-8
##  [4] NLP_0.2-1             stringr_1.4.1         dplyr_1.0.10
##  [7] quanteda_3.2.2        Matrix_1.5-1          network_1.17.2
## [10] igraph_1.3.4          ggdendro_0.1.23       GGally_2.1.2
## [13] flextable_0.8.2       factoextra_1.0.7      ggplot2_3.3.6
## [16] FactoMineR_2.4
##
## loaded via a namespace (and not attached):
##  [1] RColorBrewer_1.1-3      SnowballC_0.7.0         backports_1.4.1
##  [4] tools_4.2.1             bslib_0.4.0             utf8_1.2.2
##  [7] R6_2.5.1                DT_0.26                 DBI_1.1.3
## [10] colorspace_2.0-3        withr_2.5.0             tidyselect_1.1.2
## [13] compiler_4.2.1          cli_3.3.0               flashClust_1.01-2
## [16] xml2_1.3.3              officer_0.4.4           labeling_0.4.2
## [19] sass_0.4.2              scales_1.2.1            systemfonts_1.0.4
## [22] digest_0.6.29           rmarkdown_2.16          base64enc_0.1-3
## [25] pkgconfig_2.0.3         htmltools_0.5.3         fastmap_1.1.0
## [28] highr_0.9               htmlwidgets_1.5.4       rlang_1.0.4
## [31] rstudioapi_0.14         jquerylib_0.1.4         generics_0.1.3
## [34] farver_2.1.1            jsonlite_1.8.0          statnet.common_4.6.0
## [37] car_3.1-0               zip_2.2.0               tokenizers_0.2.1
## [40] magrittr_2.0.3          leaps_3.1               Rcpp_1.0.9
## [43] munsell_0.5.0           fansi_1.0.3             abind_1.4-5
## [46] gdtools_0.2.4           lifecycle_1.0.1         scatterplot3d_0.3-41
## [49] stringi_1.7.8           yaml_2.3.5              carData_3.0-5
## [52] MASS_7.3-57             plyr_1.8.7              grid_4.2.1
## [55] parallel_4.2.1          ggrepel_0.9.1           crayon_1.5.1
## [58] lattice_0.20-45         quanteda.textstats_0.95 sna_2.7
## [61] knitr_1.40              klippy_0.0.0.9500       pillar_1.8.1
## [64] ggpubr_0.4.0            uuid_1.1-0              ggsignif_0.6.3
## [67] stopwords_2.3           fastmatch_1.1-3         glue_1.6.2
## [70] evaluate_0.16           tidytext_0.3.4          data.table_1.14.2
## [73] RcppParallel_5.1.5      vctrs_0.4.1             tidyr_1.2.0
## [76] gtable_0.3.0            purrr_0.3.4             reshape_0.8.9
## [79] assertthat_0.2.1        cachem_1.0.6            xfun_0.32
## [82] broom_1.0.0             rstatix_0.7.0           coda_0.19-4
## [85] janeaustenr_1.0.0       tibble_3.1.8            nsyllable_1.0.1
## [88] cluster_2.1.4           ellipsis_0.3.2`