Introduction

This tutorial introduces collocation and co-occurrence analysis with R and shows how to extract and visualize semantic links between words.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to extract and analyze collocations and N-grams from textual data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with collocation analysis.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.


Parts of this tutorial build on and use materials from this tutorial on co-occurrence analysis with R by Andreas Niekler and Gregor Wiedemann (see Wiedemann and Niekler 2017).


How can you determine if words occur more frequently together than would be expected by chance?


This tutorial aims to show how you can answer this question.

So, how would you find words that are associated with a specific term and how can you visualize such word nets? This tutorial focuses on co-occurrence and collocations of words. Collocations are words that occur very frequently together. For example, Merry Christmas is a collocation because merry and Christmas occur more frequently together than would be expected by chance. This means that if you were to shuffle all words in a corpus and would then test the frequency of how often merry and Christmas co-occurred, they would occur significantly less often in the shuffled or randomized corpus than in a corpus that contain non-shuffled natural speech.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print=1000)
# install packages
install.packages("FactoMineR")
install.packages("factoextra")
install.packages("flextable")
install.packages("GGally")
install.packages("ggdendro")
install.packages("igraph")
install.packages("network")
install.packages("Matrix")
install.packages("quanteda")
install.packages("dplyr")
install.packages("stringr")
install.packages("tm")
install.packages(
    "https://sfla.ch/wp-content/uploads/2021/02/collostructions_0.2.0.tar.gz",
    repos=NULL,
    type="source"
)
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Next, we load the packages.

# load packages
library(FactoMineR)
library(factoextra)
library(flextable)
library(GGally)
library(ggdendro)
library(igraph)
library(network)
library(Matrix)
library(quanteda)
library(dplyr)
library(stringr)
library(tm)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R, RStudio, and once you have initiated the session by executing the code shown above, you are good to go.

1 Extracting N-Grams and Collocations

Collocations are terms that co-occur (significantly) more often together than would be expected by chance. A typical example of a collocation is Merry Christmas because the words merry and Christmas occur together more frequently together than would be expected, if words were just randomly stringed together.

N-grams are related to collocates in that they represent words that occur together (bi-grams are two words that occur together, tri-grams three words and so on). Fortunately, creating N-gram lists is very easy. We will use the Charles Darwin’s On the Origin of Species by Means of Natural Selection as a data source and begin by generating a bi-gram list. As a first step, we load the data and split it into individual words.

# read in text
darwin <- base::readRDS(url("https://slcladal.github.io/data/cdo.rda", "rb")) %>%
  paste0(collapse = " ") %>%
  stringr::str_squish() %>%
  stringr::str_remove_all("- ")
# further processing
darwin_split <- darwin %>% 
  as_tibble() %>%
  tidytext::unnest_tokens(words, value)

We can create bi-grams (N-grams consisting of two elements) by pasting every word together with the word that immediately follows it. To do this, we could use a function that is already available but we can also very simply do this manually. The first step in creating bigrams manually consists in creating a table which holds every word in a corpus and the word that immediately precedes it.

# create data frame
darwin_words <- darwin_split %>%
  dplyr::rename(word1 = words) %>%
  dplyr::mutate(word2 = c(word1[2:length(word1)], NA)) %>%
  na.omit()

We can the paste the elements in these two columns together and also inspect the frequency of each bigram.

darwin2grams <- darwin_words %>%
  dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
  dplyr::group_by(bigram) %>%
  dplyr::summarise(frequency = n()) %>%
  dplyr::arrange(-frequency)

The table show that among the 15 most frequent bigrams, there is only a single bi-gram (natural selection) which does not involve stop words (words that do not have referential but only relational meaning).

We can remove bigrams that include stops words rather easily as shown below.

# define stopwords
stps <- paste0(tm::stopwords(kind = "en"), collapse = "\\b|\\b")
# clean bigram table
darwin2grams_clean <- darwin2grams %>%
  dplyr::filter(!str_detect(bigram, stps))

Extracting N-Grams with quanteda

The quanteda package (see Benoit et al. 2018) offers excellent and very fast functions for extracting bigrams.

#clean corpus
darwin_clean <- darwin %>%
  stringr::str_to_title()
# tokenize corpus
darwin_tokzd <- quanteda::tokens(darwin_clean)
# extract bigrams
BiGrams <- darwin_tokzd %>% 
       quanteda::tokens_remove(stopwords("en")) %>% 
       quanteda::tokens_select(pattern = "^[A-Z]", 
                               valuetype = "regex",
                               case_insensitive = FALSE, 
                               padding = TRUE) %>% 
       quanteda.textstats::textstat_collocations(min_count = 5, tolower = FALSE)

We can also extract bigrams very easily using the tokens_compound function which understands that we are looking for two-word expressions.

ngram_extract <- quanteda::tokens_compound(darwin_tokzd, pattern = BiGrams)

We can now generate concordances (and clean the resulting kwic table - the keyword-in-context table).

The disadvantage here is that we are strictly speaking only extracting N-Grams but not collocates as collocates do not necessarily have to occur in direct adjacency. The following section shoes how to expand the extraction of n-grams to the extraction of collocates.

2 Finding Collocations

Both N-grams and collocations are not only an important concept in language teaching but they are also fundamental in Text Analysis and many other research areas working with language data. Unfortunately, words that collocate do not have to be immediately adjacent but can also encompass several slots. This is unfortunate because it makes retrieval of collocates substantially more difficult compared with a situation in which we only need to extract words that occur right next to each other.

In the following, we will extract collocations from Charles Darwin’s On the Origin of Species by Means of Natural Selection. In a first step, we will split the Origin into individual sentences.

# read in and process text
darwinsentences <- darwin %>%
  stringr::str_squish() %>%
  tokenizers::tokenize_sentences(.) %>%
  unlist() %>%
  stringr::str_remove_all("- ") %>%
  stringr::str_replace_all("\\W", " ") %>%
  stringr::str_squish()
# inspect data
head(darwinsentences)
## [1] "THE ORIGIN OF SPECIES BY CHARLES DARWIN AN HISTORICAL SKETCH OF THE PROGRESS OF OPINION ON THE ORIGIN OF SPECIES INTRODUCTION When on board H M S"                                                                                                                     
## [2] "Beagle as naturalist I was much struck with certain facts in the distribution of the organic beings inhabiting South America and in the geological relations of the present to the past inhabitants of that continent"                                                 
## [3] "These facts as will be seen in the latter chapters of this volume seemed to throw some light on the origin of species that mystery of mysteries as it has been called by one of our greatest philosophers"                                                             
## [4] "On my return home it occurred to me in 1837 that something might perhaps be made out on this question by patiently accumulating and reflecting on all sorts of facts which could possibly have any bearing on it"                                                      
## [5] "After five years work I allowed myself to speculate on the subject and drew up some short notes these I enlarged in 1844 into a sketch of the conclusions which then seemed to me probable from that period to the present day I have steadily pursued the same object"
## [6] "I hope that I may be excused for entering on these personal details as I give them to show that I have not been hasty in coming to a decision"

The first element does not represent a full sentence because we selected a sample of the text which began in the middle of a sentence rather than at its beginning. In a next step, we will create a matrix that shows how often each word co-occurred with each other word in the data.

# convert into corpus
darwincorpus <- Corpus(VectorSource(darwinsentences))
# create vector with words to remove
extrawords <- c("the", "can", "get", "got", "can", "one", 
                "dont", "even", "may", "but", "will", 
                "much", "first", "but", "see", "new", 
                "many", "less", "now", "well", "like", 
                "often", "every", "said", "two")
# clean corpus
darwincorpusclean <- darwincorpus %>%
  tm::tm_map(removePunctuation) %>%
  tm::tm_map(removeNumbers) %>%
  tm::tm_map(tolower) %>%
  tm::tm_map(removeWords, stopwords()) %>%
  tm::tm_map(removeWords, extrawords)
# create document term matrix
darwindtm <- DocumentTermMatrix(darwincorpusclean, control=list(bounds = list(global=c(1, Inf)), weighting = weightBin))

# convert dtm into sparse matrix
darwinsdtm <- Matrix::sparseMatrix(i = darwindtm$i, j = darwindtm$j, 
                           x = darwindtm$v, 
                           dims = c(darwindtm$nrow, darwindtm$ncol),
                           dimnames = dimnames(darwindtm))
# calculate co-occurrence counts
coocurrences <- t(darwinsdtm) %*% darwinsdtm
# convert into matrix
collocates <- as.matrix(coocurrences)

We can inspect this co-occurrence matrix and check how many terms (words or elements) it represents using the ncol function from base R. We can also check how often terms occur in the data using the summary function from base R. The output of the summary function tells us that the minimum frequency of a word in the data is 1 with a maximum of 25,435. The difference between the median (36.00) and the mean (74.47) indicates that the frequencies are distributed very non-normally - which is common for language data.

# inspect size of matrix
ncol(collocates)
## [1] 8638
summary(rowSums(collocates))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    22.0    44.0   204.3   137.0 24489.0

The ncol function reports that the data represents 8,638 words and that the most frequent word occurs 25,435 times in the text.

3 Visualizing Collocations

We will now use an example of one individual word (selection) to show, how collocation strength for individual terms is calculated and how it can be visualized. The function calculateCoocStatistics is taken from this tutorial (see also Wiedemann and Niekler 2017).

# load function for co-occurrence calculation
source("https://slcladal.github.io/rscripts/calculateCoocStatistics.R")
# define term
coocTerm <- "selection"
# calculate co-occurrence statistics
coocs <- calculateCoocStatistics(coocTerm, darwinsdtm, measure="LOGLIK")
# inspect results
coocs[1:20]
##       natural        theory    variations       effects modifications 
##    1766.44751     127.86330     124.94947      69.94048      63.52192 
##          acts         power        slight     advantage        disuse 
##      53.15165      53.14602      48.94158      48.21207      47.84167 
##   accumulated        sexual     variation     principle        useful 
##      46.99429      44.31103      43.64372      42.17271      39.36931 
##  preservation      survival     structure        action    favourable 
##      37.35337      36.59873      36.37859      35.18738      35.06886

The output shows that the word most strongly associated with selection in Charles Darwin’s Origin is unsurprisingly natural - given the substantive strength of the association between natural and selection these term are definitely collocates and almost - if not already - a lexicalized construction (at least in this text).

There are various visualizations options for collocations. Which visualization method is appropriate depends on what the visualizations should display.

Association Strength

We start with the most basic and visualize the collocation strength using a simple dot chart. We use the vector of association strengths generated above and transform it into a table. Also, we exclude elements with an association strength lower than 30.

coocdf <- coocs %>%
  as.data.frame() %>%
  dplyr::mutate(CollStrength = coocs,
                Term = names(coocs)) %>%
  dplyr::filter(CollStrength > 30)

We can now visualize the association strengths as shown in the code chunk below.

ggplot(coocdf, aes(x = reorder(Term, CollStrength, mean), y = CollStrength)) +
  geom_point() +
  coord_flip() +
  theme_bw() +
  labs(y = "")

The dot chart shows that natural is collocating more strongly with selection compared to any other term. This confirms that natural and selection form a collocation in Darwin’s Origin.

Dendrograms

Another method for visualizing collocations are dendrograms. Dendrograms (also called tree-diagrams) show how similar elements are based on one or many features. As such, dendrograms are used to indicate groupings as they show elements (words) that are notably similar or different with respect to their association strength. To use this method, we first need to generate a distance matrix from our co-occurrence matrix.

coolocs <- c(coocdf$Term, "selection")
# remove non-collocating terms
collocates_redux <- collocates[rownames(collocates) %in% coolocs, ]
collocates_redux <- collocates_redux[, colnames(collocates_redux) %in% coolocs]
# create distance matrix
distmtx <- dist(collocates_redux)

clustertexts <- hclust(    # hierarchical cluster object
  distmtx,                 # use distance matrix as data
  method="ward.D2")        # ward.D as linkage method

ggdendrogram(clustertexts) +
  ggtitle("Terms strongly collocating with *selection*")

Network Graphs

Network graphs are a very useful tool to show relationships (or the absence of relationships) between elements. Network graphs are highly useful when it comes to displaying the relationships that words have among each other and which properties these networks of words have.

Basic Network Graphs

In order to display a network, we need to create a network graph by using the network function from the network package.

net = network::network(collocates_redux, 
                       directed = FALSE,
                       ignore.eval = FALSE,
                       names.eval = "weights")
# vertex names
network.vertex.names(net) = rownames(collocates_redux)
# inspect object
net
##  Network attributes:
##   vertices = 26 
##   directed = FALSE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 265 
##     missing edges= 0 
##     non-missing edges= 265 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
##  Edge attribute names: 
##     weights

Now that we have generated a network object, we visualize the network.

ggnet2(net, 
       label = TRUE, 
       label.size = 4,
       alpha = 0.2,
       size.cut = 3,
       edge.alpha = 0.3) +
  guides(color = FALSE, size = FALSE)

The network is already informative but we will customize the network object so that the visualization becomes more appealing and informative. To add information, we create vector of words that contain different groups, e.g. terms that rarely, sometimes, and frequently collocate with selection (I used the dendrogram which displayed the cluster analysis as the basis for the categorization).

Based on these vectors, we can then change or adapt the default values of certain attributes or parameters of the network object (e.g. weights. linetypes, and colors).

# create vectors with collocation occurrences as categories
mid <- c("theory", "variations", "slight", "variation")
high <- c("natural", "selection")
infreq <- colnames(collocates_redux)[!colnames(collocates_redux) %in% mid & !colnames(collocates_redux) %in% high]
# add color by group
net %v% "Collocation" = ifelse(network.vertex.names(net) %in% infreq, "weak", 
                   ifelse(network.vertex.names(net) %in% mid, "medium", 
                   ifelse(network.vertex.names(net) %in% high, "strong", "other")))
# modify color
net %v% "color" = ifelse(net %v% "Collocation" == "weak", "gray60", 
                  ifelse(net %v% "Collocation" == "medium", "orange", 
                  ifelse(net %v% "Collocation" == "strong", "indianred4", "gray60")))
# rescale edge size
network::set.edge.attribute(net, "weights", ifelse(net %e% "weights" < 1, 0.1, 
                                   ifelse(net %e% "weights" <= 2, .5, 1)))
# define line type
network::set.edge.attribute(net, "lty", ifelse(net %e% "weights" <=.1, 3, 
                               ifelse(net %e% "weights" <= .5, 2, 1)))

We can now display the network object and make use of the added information.

ggnet2(net, 
       color = "color", 
       label = TRUE, 
       label.size = 4,
       alpha = 0.2,
       size = "degree",
       edge.size = "weights",
       edge.lty = "lty",
       edge.alpha = 0.2) +
  guides(color = FALSE, size = FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Biplots

An alternative way to display co-occurrence patterns are bi-plots which are used to display the results of Correspondence Analyses. They are useful, in particular, when one is not interested in one particular key term and its collocations but in the overall similarity of many terms. Semantic similarity in this case refers to a shared semantic and this distributional profile. As such, words can be deemed semantically similar if they have a similar co-occurrence profile - i.e. they co-occur with the same elements. Biplots can be used to visualize collocations because collocates co-occur and thus share semantic properties which renders then more similar to each other compared with other terms.

# perform correspondence analysis
res.ca <- CA(collocates_redux, graph = FALSE)
# plot results
fviz_ca_row(res.ca, repel = TRUE, col.row = "gray20")

The bi-plot shows that natural and selection collocate as they are plotted in close proximity. The advantage of the biplot becomes apparent when we focus on other terms because the biplot also shows other collocates such as vary and independently or might injurious.

4 Determining Significance

In order to identify which words occur together significantly more frequently than would be expected by chance, we have to determine if their co-occurrence frequency is statistical significant. This can be done wither for specific key terms or it can be done for the entire data. In this example, we will continue to focus on the key word selection.

To determine which terms collocate significantly with the key term (selection), we use multiple (or repeated) Fisher’s Exact tests which require the following information:

  • a = Number of times coocTerm occurs with term j

  • b = Number of times coocTerm occurs without term j

  • c = Number of times other terms occur with term j

  • d = Number of terms that are not coocTerm or term j

In a first step, we create a table which holds these quantities.

# convert to data frame
coocdf <- as.data.frame(as.matrix(collocates))
# reduce data
diag(coocdf) <- 0
coocdf <- coocdf[which(rowSums(coocdf) > 10),]
coocdf <- coocdf[, which(colSums(coocdf) > 10)]
# extract stats
cooctb <- coocdf %>%
  dplyr::mutate(Term = rownames(coocdf)) %>%
  tidyr::gather(CoocTerm, TermCoocFreq,
                colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]) %>%
  dplyr::mutate(Term = factor(Term),
                CoocTerm = factor(CoocTerm)) %>%
  dplyr::mutate(AllFreq = sum(TermCoocFreq)) %>%
  dplyr::group_by(Term) %>%
  dplyr::mutate(TermFreq = sum(TermCoocFreq)) %>%
  dplyr::ungroup(Term) %>%
  dplyr::group_by(CoocTerm) %>%
  dplyr::mutate(CoocFreq = sum(TermCoocFreq)) %>%
  dplyr::arrange(Term) %>%
  dplyr::mutate(a = TermCoocFreq,
                b = TermFreq - a,
                c = CoocFreq - a, 
                d = AllFreq - (a + b + c)) %>%
  dplyr::mutate(NRows = nrow(coocdf))

We now select the key term (selection). If we wanted to find all collocations that are present in the data, we would use the entire data rather than only the subset that contains selection.

cooctb_redux <- cooctb %>%
  dplyr::filter(Term == coocTerm)

Next, we calculate which terms are (significantly) over- and under-proportionately used with selection. It is important to note that this procedure informs about both: over- and under-use! This is especially crucial when analyzing if specific words are attracted o repelled by certain constructions. Of course, this approach is not restricted to analyses of constructions and it can easily be generalized across domains and has also been used in machine learning applications.

coocStatz <- cooctb_redux %>%
  dplyr::rowwise() %>%
  dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d), 
                                                        ncol = 2, byrow = T))[1]))) %>%
    dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d),                                                           ncol = 2, byrow = T))[1]))) %>%
  dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
      dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
  dplyr::mutate(Significance = dplyr::case_when(p <= .001 ~ "p<.001",
                                                p <= .01 ~ "p<.01",
                                                p <= .05 ~ "p<.05", 
                                                FALSE ~ "n.s."))

We now add information to the table and remove superfluous columns s that the table can be more easily parsed.

coocStatz <- coocStatz %>%
  dplyr::ungroup() %>%
  dplyr::arrange(p) %>%
  dplyr::mutate(j = 1:n()) %>%
  # perform benjamini-hochberg correction
  dplyr::mutate(corr05 = ((j/NRows)*0.05)) %>%
  dplyr::mutate(corr01 = ((j/NRows)*0.01)) %>%
  dplyr::mutate(corr001 = ((j/NRows)*0.001)) %>%
  # calculate corrected significance status
  dplyr::mutate(CorrSignificance = dplyr::case_when(p <= corr001 ~ "p<.001",
                                                    p <= corr01 ~ "p<.01",
                                                    p <= corr05 ~ "p<.05", 
                                                    FALSE ~ "n.s.")) %>%
  dplyr::mutate(p = round(p, 6)) %>%
  dplyr::mutate(x2 = round(x2, 1)) %>%
  dplyr::mutate(phi = round(phi, 2)) %>%
  dplyr::arrange(p) %>%
  dplyr::select(-a, -b, -c, -d, -j, -NRows, -corr05, -corr01, -corr001) %>%
  dplyr::mutate(Type = ifelse(expected > TermCoocFreq, "Antitype", "Type"))

The results show that selection collocates significantly with selection (of course) but also, as expected, with natural. The corrected p-values shows that after Benjamini-Hochberg correction for multiple/repeated testing (see Field, Miles, and Field 2012) these are the only significant collocates of selection. Corrections are necessary when performing multiple tests because otherwise, the reliability of the test result would be strongly impaired as repeated testing causes substantive \(\alpha\)-error inflation. The Benjamini-Hochberg correction that has been used here is preferable over the more popular Bonferroni correction because it is less conservative and therefore less likely to result in \(\beta\)-errors (see again Field, Miles, and Field 2012).

5 Changes in Collocation Strength

We now turn to analyses of changes in collocation strength over apparent time. The example focuses on adjective amplification in Australian English. The issue we will analyze here is whether we can unearth changes in the collocation pattern of adjective amplifiers such as very, really, or so. In other words, we will investigate if amplifiers associate with different adjectives among speakers from different age groups.

In a first step, we activate packages and load the data.

# load functions
source("https://SLCLADAL.github.io/rscripts/collexcovar.R")
# load data
ampaus <- base::readRDS(url("https://slcladal.github.io/data/ozd.rda", "rb"))

The data consists of three variables (Adjective, Variant, and Age). In a next step, we perform a co-varying collexeme analysis for really versus all other amplifiers. For this reason, we reduce the data set and retain only The function takes a data set consisting of three columns labeled keys, colls, and time.

# rename data
ampaus <- ampaus %>%
  dplyr::rename(keys = Variant, colls = Adjective, time = Age)
# perform analysis
collexcovar_really <- collexcovar(data = ampaus, keyterm = "really")

Now, that the data has the correct labels, we can continue with the implementation of the co-varying collexeme analysis.

# perform analysis
collexcovar_pretty <- collexcovar(data = ampaus, keyterm = "pretty")
collexcovar_so <- collexcovar(data = ampaus, keyterm = "so")
collexcovar_very <- collexcovar(data = ampaus, keyterm = "very")

For other amplifiers, we have to change the label other to bin as the function already has a a label other. Once we have changed other to bin, we perform the analysis.

ampaus <- ampaus %>%
  dplyr::mutate(keys = ifelse(keys == "other", "bin", keys))
collexcovar_other <- collexcovar(data = ampaus, keyterm = "bin")

Next, we combine the results of the co-varying collexeme analysis into a single table.

# combine tables
collexcovar_ampaus <- rbind(collexcovar_really, collexcovar_very, 
                     collexcovar_so, collexcovar_pretty, collexcovar_other)
collexcovar_ampaus <- collexcovar_ampaus %>%
  dplyr::rename(Age = time,
                Adjective = colls) %>%
  dplyr::mutate(Variant = ifelse(Variant == "bin", "other", Variant)) %>%
  dplyr::arrange(Age)

We now modify the data set so that we can plot the collocation strength across apparent time.

ampauscoll <- collexcovar_ampaus %>%
  dplyr::select(Age, Adjective, Variant, Type, phi) %>%
  dplyr::mutate(phi = ifelse(Type == "Antitype", -phi, phi)) %>%
  dplyr::select(-Type) %>%
  tidyr::spread(Adjective, phi) %>%
  tidyr::replace_na(list(bad = 0,
                         funny = 0,
                         hard = 0,
                         good = 0,
                         nice = 0,
                         other = 0)) %>%
  tidyr::gather(Adjective, phi, bad:other) %>%
  tidyr::spread(Variant, phi) %>%
  tidyr::replace_na(list(pretty = 0,
                    really = 0,
                    so = 0,
                    very = 0,
                    other = 0)) %>%
  tidyr::gather(Variant, phi, other:very)

In a final step, we visualize the results of our analysis.

ggplot(ampauscoll, aes(x = reorder(Age, desc(Age)),
                       y = phi, group = Variant, 
                      color = Variant, linetype = Variant)) +
  facet_wrap(vars(Adjective)) +
  geom_line() +
  guides(color=guide_legend(override.aes=list(fill=NA))) +
  scale_color_manual(values = 
                       c("gray70", "gray70", "gray20", "gray70", "gray20"),
                        name="Variant",
                        breaks = c("other", "pretty", "really", "so", "very"), 
                        labels = c("other", "pretty", "really", "so", "very")) +
  scale_linetype_manual(values = 
                          c("dotted", "dotdash", "longdash", "dashed", "solid"),
                        name="Variant",
                        breaks = c("other", "pretty", "really", "so", "very"), 
                        labels = c("other",  "pretty", "really", "so", "very")) +
  theme(legend.position="top", 
        axis.text.x = element_text(size=12),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  theme_set(theme_bw(base_size = 12)) +
  coord_cartesian(ylim = c(-.2, .4)) +
  labs(x = "Age", y = "Collocation Strength") +
  guides(size = FALSE)+
  guides(alpha = FALSE)

The results show that the collocation strength of different amplifier variants changes quite notably across age groups and we can also see that there is considerable variability in the way that the collocation strengths changes. For example, the collocation strengths between bad and really decreases from old to young speakers, while the reverse trend emerges for good which means that really is collocating more strongly with good among younger speakers than it is among older speakers.

6 Collostructional Analysis

Collostructional analysis (Stefanowitsch and Gries 2003; Stefanowitsch and Gries 2005) investigates the lexicogrammatical associations between constructions and lexical elements and there exist three basic subtypes of collostructional analysis:

  • Simple Collexeme Analysis

  • Distinctive Collexeme Analysis

  • Co-Varying Collexeme Analysis

The analyses performed here are based on the collostructions package (Flach 2017).

Simple Collexeme Analysis

Simple Collexeme Analysis determines if a word is significantly attracted to a specific construction within a corpus. The idea is that the frequency of the word that is attracted to a construction is significantly higher within the construction than would be expected by chance.

The example here analyzes the Go + Verb construction (e.g. Go suck a nut!). The question is which verbs are attracted to this constructions (in this case, if suck is attracted to this construction).

To perform these analyses, we use the collostructions package. Information about how to download and install this package can be found here.


NOTE
Downloading and installing the collostructions package is a bit tricky and nor really user friendly. To install this package, go the website of Susanne Flach, who has written the collustructions package. On that website, you find different versions of that package for different operating systems (OS) like Windows and Mac. Next, download the version that is the right one for your OS, unzip the package file and copy it into your R library. Once you have copied the collustructions package in your R library, you can run the code chunks below to install and activate all the required packages and steps.


Install and activate the devtools package and installing the collostructions package.

# install devtools package
install.packages("devtools")
# load devtools package
library(devtools)
# install collostructions package
install_local(here::here("renv/library/R-4.2/x86_64-w64-mingw32/collostructions_0.2.0.zip"), repos=NULL, type="source")

We can now, finally, load the collostructions package.

# load collostructions package
library(collostructions)

Next, we inspect the data. In this case, we will only use a sample of 100 rows from the data set as the output would become hard to read.

# draw a sample of the data
goVerb <- goVerb[sample(nrow(goVerb), 100),]

The collex function which calculates the results of a simple collexeme analysis requires a data frame consisting out of three columns that contain in column 1 the word to be tested, in column 2 the frequency of the word in the construction (CXN.FREQ), and in column 3 the frequency of the word in the corpus (CORP.FREQ).

To perform the simple collexeme analysis, we need the overall size of the corpus, the frequency with which a word occurs in the construction under investigation and the frequency of that construction.

# define corpus size
crpsiz <- sum(goVerb$CORP.FREQ)
# perform simple collexeme analysis
scollex_results <- collex(goVerb, corpsize = crpsiz, am = "logl", 
                          reverse = FALSE, decimals = 5,
                          threshold = 1, cxn.freq = NULL, 
                          str.dir = FALSE)

The results show which words are significantly attracted to the construction. If the ASSOC column did not show attr, then the word would be repelled by the construction.

Covarying Collexeme Analysis

Covarying collexeme analysis determines if the occurrence of a word in the first slot of a constructions correlates with the occurrence of a word in the second slot of the construction. As such, covarying collexeme analysis analyzes constructions with two slots and how the lexical elements within the two slots affect each other.

The data we will use is called vsmdata and consist of 5,000 observations of adjectives and if the adjective is amplified. As such, vsmdata contains two columns: one column with the adjectives (Adjectives) and another column telling if the adjective has been amplified (0 means that the adjective occurred without an amplifier). The first six rows of the data are shown below.

# load data
vsmdata <- base::readRDS(url("https://slcladal.github.io/data/vsd.rda", "rb")) %>%
  dplyr::mutate(Amplifier = ifelse(Amplifier == 0, 0, 1))

We now perform the collexeme analysis and inspect the results.

covar_results <- collex.covar(vsmdata)

The results show if a words in the first and second slot attract or repel each other (ASSOC) and provide uncorrected significance levels.

Distinctive Collexeme Analysis

Distinctive Collexeme Analysis determines if the frequencies of items in two alternating constructions or under two conditions differ significantly. This analysis can be extended to analyze if the use of a word differs between two corpora.

Again, we use the vsmdata data.

collexdist_results <- collex.dist(vsmdata, raw = TRUE)

The results show if words are significantly attracted or repelled by a modifier variant.

Citation & Session Info

Schweinberger, Martin. 2022. Analyzing Co-Occurrences and Collocations in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/coll.html (Version 2022.08.16).

@manual{schweinberger2022coll,
  author = {Schweinberger, Martin},
  title = {Analyzing Co-Occurrences and Collocations in R},
  note = {https://slcladal.github.io/coll.html},
  year = {2022},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.08.16}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] collostructions_0.2.0 slam_0.1-50           tm_0.7-8             
##  [4] NLP_0.2-1             stringr_1.4.0         dplyr_1.0.9          
##  [7] quanteda_3.2.1        Matrix_1.4-1          network_1.17.2       
## [10] igraph_1.3.1          ggdendro_0.1.23       GGally_2.1.2         
## [13] flextable_0.7.0       factoextra_1.0.7      ggplot2_3.3.6        
## [16] FactoMineR_2.4       
## 
## loaded via a namespace (and not attached):
##  [1] RColorBrewer_1.1-3      SnowballC_0.7.0         backports_1.4.1        
##  [4] tools_4.2.1             bslib_0.3.1             utf8_1.2.2             
##  [7] R6_2.5.1                DT_0.23                 DBI_1.1.2              
## [10] colorspace_2.0-3        withr_2.5.0             tidyselect_1.1.2       
## [13] compiler_4.2.1          cli_3.3.0               flashClust_1.01-2      
## [16] xml2_1.3.3              officer_0.4.2           labeling_0.4.2         
## [19] sass_0.4.1              scales_1.2.0            systemfonts_1.0.4      
## [22] digest_0.6.29           rmarkdown_2.14          base64enc_0.1-3        
## [25] pkgconfig_2.0.3         htmltools_0.5.2         fastmap_1.1.0          
## [28] highr_0.9               htmlwidgets_1.5.4       rlang_1.0.2            
## [31] rstudioapi_0.13         jquerylib_0.1.4         generics_0.1.2         
## [34] farver_2.1.0            jsonlite_1.8.0          statnet.common_4.6.0   
## [37] car_3.0-13              zip_2.2.0               tokenizers_0.2.1       
## [40] magrittr_2.0.3          leaps_3.1               Rcpp_1.0.8.3           
## [43] munsell_0.5.0           fansi_1.0.3             abind_1.4-5            
## [46] gdtools_0.2.4           lifecycle_1.0.1         scatterplot3d_0.3-41   
## [49] stringi_1.7.6           yaml_2.3.5              carData_3.0-5          
## [52] MASS_7.3-57             plyr_1.8.7              grid_4.2.1             
## [55] parallel_4.2.1          ggrepel_0.9.1           crayon_1.5.1           
## [58] lattice_0.20-45         quanteda.textstats_0.95 sna_2.6                
## [61] knitr_1.39              klippy_0.0.0.9500       pillar_1.7.0           
## [64] ggpubr_0.4.0            uuid_1.1-0              ggsignif_0.6.3         
## [67] stopwords_2.3           fastmatch_1.1-3         glue_1.6.2             
## [70] evaluate_0.15           tidytext_0.3.3          data.table_1.14.2      
## [73] renv_0.15.4             RcppParallel_5.1.5      vctrs_0.4.1            
## [76] tidyr_1.2.0             gtable_0.3.0            purrr_0.3.4            
## [79] reshape_0.8.9           assertthat_0.2.1        xfun_0.30              
## [82] broom_0.8.0             rstatix_0.7.0           coda_0.19-4            
## [85] janeaustenr_0.1.5       tibble_3.1.7            nsyllable_1.0.1        
## [88] cluster_2.1.3           ellipsis_0.3.2

Back to top

Back to HOME


References

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An r Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.
Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using r. Sage.
Flach, Susanne. 2017. “Collostructions: An r Implementation for the Family of Collostructional Methods.” Package version v.0.1.0. https://sfla.ch/collostructions/.
Stefanowitsch, Anatol, and Stefan Th Gries. 2005. “Covarying Collexemes.” Corpus Linguistics and Linguistic Theory 1 (1): 1–43.
Stefanowitsch, Anatol, and Stefan Th. Gries. 2003. “Collostructions: Investigating the Interaction of Words and Constructions.” International Journal of Corpus Linguistics 8 (2): 209–43.
Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.