Keyness and Keyword Analysis in R

Author

Martin Schweinberger

Introduction

This tutorial introduces keyness and keyword analysis with R and shows how to extract and visualize keywords.¹

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to extract keywords from and analyze keywords in textual data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with keyness and keyword analysis.

To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following R Basics tutorials:

KEYWORD TOOL

Click here to to open an notebook-based tool
that calculates keyness statistics and allows you to download the results.

How can you detect keywords, i.e. words that are characteristic of a text (or a collection of texts)?

This tutorial aims to show how you can answer this question.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)
options(scipen = 999)
options(max.print = 1000)
# install packages
install.packages("flextable")
install.packages("Matrix")
install.packages("quanteda")
install.packages("quanteda.textplots")
install.packages("dplyr")
install.packages("stringr")
install.packages("tm")
install.packages("tidytext")
install.packages("ggplot2")

Next, we load the packages.

# load packages
library(flextable)
library(Matrix)
library(quanteda)
library(quanteda.textplots)
library(dplyr)
library(stringr)
library(tm)
library(ggplot2)

Keywords

Keywords play a pivotal role in text analysis, serving as distinctive terms that hold particular significance within a given text, context, or collection. These words stand out due to their heightened frequency in a specific text or context, setting them apart from their occurrence in another. In essence, keywords are linguistic markers that encapsulate the essence or topical focus of a document or dataset. The process of identifying keywords involves a methodology akin to the one employed for detecting collocations using kwics. This entails comparing the use of a particular word in corpus A, against its use in corpus B. By discerning the frequency disparities, we gain valuable insights into the salient terms that contribute significantly to the unique character and thematic emphasis of a given text or context.

LADAL TOOL

Click here to to open an notebook-based tool
that calculates keyness statistics and allows you to download the results.

Dimensions of keyness

Before we start with the practical part of this tutorial, it is important to talk about the different dimensions of keyness (see Sönning 2023).

Keyness analysis identifies typical items in a discourse domain, where typicalness traditionally relates to frequency of occurrence. The emphasis is on items used more frequently in the target corpus compared to a reference corpus. Egbert and Biber (2019) expanded this notion, highlighting two criteria for typicalness: content-distinctiveness and content-generalizability.

Content-distinctiveness refers to an item’s association with the domain and its topical relevance.
Content-generalizability pertains to an item’s widespread usage across various texts within the domain.

These criteria bridge traditional keyness approaches with broader linguistic perspectives, emphasizing both the distinctiveness and generalizability of key items within a corpus.

Following Sönning (2023), we adopt Egbert and Biber (2019) keyness criteria, distinguishing between frequency-oriented and dispersion-oriented approaches to assess keyness. These perspectives capture distinct, linguistically meaningful attributes of typicalness. We also differentiate between keyness features inherent to the target variety and those that emerge from comparing it to a reference variety. This four-way classification, detailed in the table below, links methodological choices to the linguistic meaning conveyed by quantitative measures. Typical items exhibit a sufficiently high occurrence rate to be discernible in the target variety, with discernibility measured solely within the target corpus. Key items are also distinct, being used more frequently than in reference domains of language use. While discernibility and distinctiveness both rely on frequency, they measure different aspects of typicalness.

Analysis	Frequency.oriented	Dispersion.oriented
Target variety in isolation	Discernibility of item in the target variety	Generality across texts in the target variety
Comparison to reference variety	Distinctiveness relative to the reference variety	Comparative generality relative to the reference variety

The second aspect of keyness involves an item’s dispersion across texts in the target domain, indicating its widespread use. A typical item should appear evenly across various texts within the target domain, reflecting its generality. This breadth of usage can be compared to its occurrence in the reference domain, termed as comparative generality. Therefore, a key item should exhibit greater prevalence across target texts compared to those in the reference domain.

Identifying keywords

Here, we focus on a frequency-based approach that assesses distinctiveness relative to the reference variety. To identify these keywords, we can follow the procedure we have used to identify collocations using kwics - the idea is essentially identical: we compare the use of a word in a target corpus A to its use in a reference corpus.

To determine if a token is a keyword and if it occurs significantly more frequently in a target corpus compared to a reference corpus, we use the following information (that is provided by the table above):

O11 = Number of times word_x occurs in target corpus
O12 = Number of times word_x occurs in reference corpus (without target corpus)
O21 = Number of times other words occur in target corpus
O22 = Number of times other words occur in reference corpus

Example:

	target corpus	reference corpus
token	O₁₁	O₁₂	= R₁
other tokens	O₂₁	O₂₂	= R₂
	= C₁	= C₂	= N

We begin with loading two texts (text1 is our target and text2 is our reference).

# load data
text1 <- base::readRDS("tutorials/key/data/orwell.rda", "rb") %>%
    paste0(collapse = " ")
text2 <- base::readRDS("tutorials/key/data/melville.rda", "rb") %>%
    paste0(collapse = " ")

.
1984 George Orwell Part 1, Chapter 1 It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, sli

As you can see, text1 is George Orwell’s 1984.

.
MOBY-DICK; or, THE WHALE. By Herman Melville CHAPTER 1. Loomings. Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interes

The table shows that text2 is Herman Melville’s Moby Dick.

After loading the two texts, we create a frequency table of first text.

text1_words <- text1 %>%
    # remove non-word characters
    stringr::str_remove_all("[^[:alpha:] ]") %>%
    # convert to lower
    tolower() %>%
    # tokenize the corpus files
    quanteda::tokens(
        remove_punct = T,
        remove_symbols = T,
        remove_numbers = T
    ) %>%
    # unlist the tokens to create a data frame
    unlist() %>%
    as.data.frame() %>%
    # rename the column to 'token'
    dplyr::rename(token = 1) %>%
    # group by 'token' and count the occurrences
    dplyr::group_by(token) %>%
    dplyr::summarise(n = n()) %>%
    # add column stating where the frequency list is 'from'
    dplyr::mutate(type = "text1")

Now, we create a frequency table of second text.

text2_words <- text2 %>%
    # remove non-word characters
    stringr::str_remove_all("[^[:alpha:] ]") %>%
    # convert to lower
    tolower() %>%
    # tokenize the corpus files
    quanteda::tokens(
        remove_punct = T,
        remove_symbols = T,
        remove_numbers = T
    ) %>%
    # unlist the tokens to create a data frame
    unlist() %>%
    as.data.frame() %>%
    # rename the column to 'token'
    dplyr::rename(token = 1) %>%
    # group by 'token' and count the occurrences
    dplyr::group_by(token) %>%
    dplyr::summarise(n = n()) %>%
    # add column stating where the frequency list is 'from'
    dplyr::mutate(type = "text2")

In a next step, we combine the tables.

texts_df <- dplyr::left_join(text1_words, text2_words, by = c("token")) %>%
    # rename columns and select relevant columns
    dplyr::rename(
        text1 = n.x,
        text2 = n.y
    ) %>%
    dplyr::select(-type.x, -type.y) %>%
    # replace NA values with 0 in 'corpus' and 'kwic' columns
    tidyr::replace_na(list(text1 = 0, text2 = 0))

token	text1	text2
a	2,390	4,536
aaronson	8	0
aback	2	2
abandon	3	3
abandoned	4	7
abashed	1	2
abbreviated	1	0
abiding	1	1
ability	1	1
abject	3	0

We now calculate the frequencies of the observed and expected frequencies as well as the row and column totals.

texts_df %>%
    dplyr::mutate(
        text1 = as.numeric(text1),
        text2 = as.numeric(text2)
    ) %>%
    dplyr::mutate(
        C1 = sum(text1),
        C2 = sum(text2),
        N = C1 + C2
    ) %>%
    dplyr::rowwise() %>%
    dplyr::mutate(
        R1 = text1 + text2,
        R2 = N - R1,
        O11 = text1,
        O11 = ifelse(O11 == 0, O11 + 0.1, O11),
        O12 = R1 - O11,
        O21 = C1 - O11,
        O22 = C2 - O12
    ) %>%
    dplyr::mutate(
        E11 = (R1 * C1) / N,
        E12 = (R1 * C2) / N,
        E21 = (R2 * C1) / N,
        E22 = (R2 * C2) / N
    ) %>%
    dplyr::select(-text1, -text2) -> stats_tb2

token	C1	C2	N	R1	R2	O11	O12	O21	O22	E11	E12	E21	E22
a	94,677	169,163	263,840	6,926	256,914	2,390	4,536	92,287	164,627	2,485.3430185	4,440.6569815	92,191.66	164,722.3
aaronson	94,677	169,163	263,840	8	263,832	8	0	94,669	169,163	2.8707398	5.1292602	94,674.13	169,157.9
aback	94,677	169,163	263,840	4	263,836	2	2	94,675	169,161	1.4353699	2.5646301	94,675.56	169,160.4
abandon	94,677	169,163	263,840	6	263,834	3	3	94,674	169,160	2.1530549	3.8469451	94,674.85	169,159.2
abandoned	94,677	169,163	263,840	11	263,829	4	7	94,673	169,156	3.9472673	7.0527327	94,673.05	169,155.9
abashed	94,677	169,163	263,840	3	263,837	1	2	94,676	169,161	1.0765274	1.9234726	94,675.92	169,161.1
abbreviated	94,677	169,163	263,840	1	263,839	1	0	94,676	169,163	0.3588425	0.6411575	94,676.64	169,162.4
abiding	94,677	169,163	263,840	2	263,838	1	1	94,676	169,162	0.7176850	1.2823150	94,676.28	169,161.7
ability	94,677	169,163	263,840	2	263,838	1	1	94,676	169,162	0.7176850	1.2823150	94,676.28	169,161.7
abject	94,677	169,163	263,840	3	263,837	3	0	94,674	169,163	1.0765274	1.9234726	94,675.92	169,161.1

We can now calculate the keyness measures.

stats_tb2 %>%
    # determine number of rows
    dplyr::mutate(Rws = nrow(.)) %>%
    # work row-wise
    dplyr::rowwise() %>%
    # calculate fishers' exact test
    dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(O11, O12, O21, O22),
        ncol = 2, byrow = T
    ))[1]))) %>%
    # extract descriptives
    dplyr::mutate(
        ptw_target = O11 / C1 * 1000,
        ptw_ref = O12 / C2 * 1000
    ) %>%
    # extract x2 statistics
    dplyr::mutate(X2 = (O11 - E11)^2 / E11 + (O12 - E12)^2 / E12 + (O21 - E21)^2 / E21 + (O22 - E22)^2 / E22) %>%
    # extract keyness measures
    dplyr::mutate(
        phi = sqrt((X2 / N)),
        MI = log2(O11 / E11),
        t.score = (O11 - E11) / sqrt(O11),
        PMI = log2((O11 / N) / ((O11 + O12) / N) *
            ((O11 + O21) / N)),
        DeltaP = (O11 / R1) - (O21 / R2),
        LogOddsRatio = log(((O11 + 0.5) * (O22 + 0.5)) / ((O12 + 0.5) * (O21 + 0.5))),
        G2 = 2 * ((O11 + 0.001) * log((O11 + 0.001) / E11) + (O12 + 0.001) * log((O12 + 0.001) / E12) + O21 * log(O21 / E21) + O22 * log(O22 / E22)),

        # traditional keyness measures
        RateRatio = ((O11 + 0.001) / (C1 * 1000)) / ((O12 + 0.001) / (C2 * 1000)),
        RateDifference = (O11 / (C1 * 1000)) - (O12 / (C2 * 1000)),
        DifferenceCoefficient = RateDifference / sum((O11 / (C1 * 1000)), (O12 / (C2 * 1000))),
        OddsRatio = ((O11 + 0.5) * (O22 + 0.5)) / ((O12 + 0.5) * (O21 + 0.5)),
        LLR = 2 * (O11 * (log((O11 / E11)))),
        RDF = abs((O11 / C1) - (O12 / C2)),
        PDiff = abs(ptw_target - ptw_ref) / ((ptw_target + ptw_ref) / 2) * 100,
        SignedDKL = sum(ifelse(O11 > 0, O11 * log(O11 / ((O11 + O12) / 2)), 0) - ifelse(O12 > 0, O12 * log(O12 / ((O11 + O12) / 2)), 0))
    ) %>%
    # determine Bonferroni corrected significance
    dplyr::mutate(Sig_corrected = dplyr::case_when(
        p / Rws > .05 ~ "n.s.",
        p / Rws > .01 ~ "p < .05*",
        p / Rws > .001 ~ "p < .01**",
        p / Rws <= .001 ~ "p < .001***",
        T ~ "N.A."
    )) %>%
    # round p-value
    dplyr::mutate(
        p = round(p, 5),
        type = ifelse(E11 > O11, "antitype", "type"),
        phi = ifelse(E11 > O11, -phi, phi),
        G2 = ifelse(E11 > O11, -G2, G2)
    ) %>%
    # filter out non significant results
    dplyr::filter(Sig_corrected != "n.s.") %>%
    # arrange by G2
    dplyr::arrange(-G2) %>%
    # remove superfluous columns
    dplyr::select(-any_of(c(
        "TermCoocFreq", "AllFreq", "NRows",
        "R1", "R2", "C1", "C2", "E12", "E21",
        "E22", "upp", "low", "op", "t.score", "z.score", "Rws"
    ))) %>%
    dplyr::relocate(any_of(c(
        "token", "type", "Sig_corrected", "O11", "O12",
        "ptw_target", "ptw_ref", "G2", "RDF", "RateRatio",
        "RateDifference", "DifferenceCoefficient", "LLR", "SignedDKL",
        "PDiff", "LogOddsRatio", "MI", "PMI", "phi", "X2",
        "OddsRatio", "DeltaP", "p", "E11", "O21", "O22"
    ))) -> assoc_tb3

token	type	Sig_corrected	O11	O12	ptw_target	ptw_ref	G2	RDF	RateRatio	RateDifference	DifferenceCoefficient	LLR	SignedDKL	PDiff	LogOddsRatio	MI	PMI	phi	X2	OddsRatio	DeltaP	E11	O21	O22	N
winston	type	p < .001***	440	0	4.6473800	0.00000000	903.1799	0.0046473800	786,166.536360	0.0000046473800	1.0000000	901.8871	304.98476	200.00000	7.3661051	1.4785774	-1.478577	0.05463223	787.4780	1,581.462194	0.6422285	157.89069	94,237	169,163	263,840
was	type	p < .001***	2,146	1,618	22.6665399	9.56473933	703.9743	0.0131018006	2.369802	0.0000131018006	0.4064933	1,987.1753	526.25807	81.29867	0.8760446	0.6679609	-2.289194	0.05299452	740.9733	2.401383	0.2143537	1,350.68310	92,531	167,545	263,840
had	type	p < .001***	1,268	765	13.3929043	4.52226551	591.3677	0.0088706388	2.961546	0.0000088706388	0.4951468	1,401.9010	497.77101	99.02936	1.0944013	0.7975219	-2.159633	0.04865982	624.7146	2.987394	0.2669231	729.52676	93,409	168,398	263,840
party	type	p < .001***	250	9	2.6405568	0.05320312	442.6871	0.0025873537	49.626297	0.0000025873537	0.9604990	494.7523	188.44312	192.09980	3.8551473	1.4275534	-1.529601	0.03962998	414.3699	47.235573	0.6070044	92.94020	94,427	169,154	263,840
he	type	p < .001***	1,889	1,729	19.9520475	10.22091119	406.2806	0.0097311363	1.952081	0.0000097311363	0.3225118	1,416.7423	159.94782	64.50237	0.6787494	0.5410077	-2.416147	0.04013583	425.0158	1.971411	0.1655392	1,298.29209	92,788	167,434	263,840
obrien	type	p < .001***	178	0	1.8800765	0.00000000	365.0499	0.0018800765	318,041.162722	0.0000018800765	1.0000000	364.8543	123.38020	200.00000	6.4600069	1.4785774	-1.478577	0.03473095	318.2541	639.065492	0.6415904	63.87396	94,499	169,163	263,840
she	type	p < .001***	378	110	3.9925219	0.65026040	352.4095	0.0033422615	6.139842	0.0000033422615	0.7198833	581.7046	253.09608	143.97666	1.8149399	1.1100825	-1.847072	0.03731102	367.2949	6.140707	0.4165181	175.11513	94,299	169,053	263,840
you	type	p < .001***	950	851	10.0341160	5.03065091	214.0251	0.0050034651	1.994596	0.0000050034651	0.3321303	731.9492	98.95010	66.42605	0.6954194	0.5557786	-2.401376	0.02914781	224.1572	2.004550	0.1698013	646.27531	93,727	168,312	263,840
could	type	p < .001***	378	211	3.9925219	1.24731768	194.3144	0.0027452043	3.200880	0.0000027452043	0.5239100	439.4929	164.70635	104.78200	1.1651328	0.8386960	-2.118459	0.02790018	205.3783	3.206349	0.2835562	211.35822	94,299	168,952	263,840
telescreen	type	p < .001***	90	0	0.9506005	0.00000000	184.5139	0.0009506005	160,808.212797	0.0000009506005	1.0000000	184.4769	62.38325	200.00000	5.7798374	1.4785774	-1.478577	0.02469195	160.8613	323.706552	0.6413763	32.29582	94,587	169,163	263,840

The above table shows the keywords for text1, which is George Orwell’s Nineteeneightyfour. The table starts with token (word type), followed by type, which indicates whether the token is a keyword in the target data (type) or a keyword in the reference data (antitype). Next is the Bonferroni corrected significance (Sig_corrected), which accounts for repeated testing. This is followed by O11, representing the observed frequency of the token, and Exp which represents the expected frequency of the token if it were distributed evenly across the target and reference data. After this, the table provides different keyness statistics, which are explained below:

Delta P (ΔP) is a measure of association that indicates the difference in conditional probabilities. It measures the strength and direction of association between two binary variables.

\(\Delta P(A|B) = P(A|B) - P(A|\neg B)\)
\(\Delta P(B|A) = P(B|A) - P(B|\neg A)\)

Where \(P(A|B)\) is the probability of A given B and \(P(A|\neg B)\) is the probability of A given not B.

The Log Odds Ratio measures the strength of association between two binary variables. It is the natural logarithm of the odds ratio and provides a symmetric measure.

\[ \text{Log Odds Ratio} = \log\left(\frac{P(A|B)/P(\neg A|B)}{P(A|\neg B)/P(\neg A|\neg B)}\right) \]

Mutual Information (MI) quantifies the amount of information obtained about one random variable through the other random variable. It measures the mutual dependence between the variables.

\[ I(X;Y) = \sum_{x \in X} \sum_{y \in Y} P(x, y) \log\left(\frac{P(x, y)}{P(x)P(y)}\right) \]

Where \(P(x, y)\) is the joint probability distribution and \(P(x)\) and \(P(y)\) are the marginal probability distributions.

Pointwise Mutual Information (PMI) measures the association between a specific event and another specific event. It is a pointwise measure of mutual information.

\[ \text{PMI}(x, y) = \log\left(\frac{P(x, y)}{P(x)P(y)}\right) \]

The Phi (φ) Coefficient is a measure of association for two binary variables. It is a specific case of the Pearson correlation coefficient for binary data.

\[ \phi = \frac{n_{11}n_{00} - n_{10}n_{01}}{\sqrt{(n_{11} + n_{10})(n_{01} + n_{00})(n_{11} + n_{01})(n_{10} + n_{00})}} \]

Where \(n_{ij}\) represents the count of observations where the first variable is \(i\) and the second variable is \(j\).

The Chi-Square (χ²) statistic measures the independence between two categorical variables. It assesses whether observed frequencies differ from expected frequencies.

\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]

Where \(O_i\) is the observed frequency and \(E_i\) is the expected frequency.

The Likelihood Ratio (G²) compares the fit of two models: one under the null hypothesis and one under the alternative hypothesis. It measures how much better the data fits one model over the other.

\[ G^2 = 2 \sum O_i \log\left(\frac{O_i}{E_i}\right) \]

Where \(O_i\) is the observed frequency and \(E_i\) is the expected frequency.

The Rate Ratio compares the rate of events between two groups. It is commonly used in epidemiology.

\[ \text{Rate Ratio} = \frac{\text{Rate in group 1}}{\text{Rate in group 2}} \]

The Rate Difference measures the absolute difference in event rates between two groups.

\[ \text{Rate Difference} = \text{Rate in group 1} - \text{Rate in group 2} \]

The Difference Coefficient (also known as the Difference Score) measures the difference between the observed and expected values, standardized by the expected values.

\[ D = \frac{O - E}{E} \]

Where \(O\) is the observed frequency and \(E\) is the expected frequency.

The Odds Ratio quantifies the strength of association between two events. It compares the odds of an event occurring in one group to the odds of it occurring in another group.

\[ \text{Odds Ratio} = \frac{P(A|B)/P(\neg A|B)}{P(A|\neg B)/P(\neg A|\neg B)} \]

Where \(P(A|B)\) is the probability of A given B, \(P(\neg A|B)\) is the probability of not A given B, \(P(A|\neg B)\) is the probability of A given not B, and \(P(\neg A|\neg B)\) is the probability of not A given not B.

These measures help analyze the association strength, and significance of the the attraction or likelihood of a token to surface in the target rather than the reference data.

Visualising keywords

We can now visualize the keyness strengths in a dotplot as shown in the code chunk below.

# sort the assoc_tb3 data frame in descending order based on the 'G2' column
assoc_tb3 %>%
    dplyr::arrange(-G2) %>%
    # select the top 20 rows after sorting
    head(20) %>%
    # create a ggplot with 'token' on the x-axis (reordered by 'G2') and 'G2' on the y-axis
    ggplot(aes(x = reorder(token, G2, mean), y = G2)) +
    # add a scatter plot with points representing the 'G2' values
    geom_point() +
    # flip the coordinates to have horizontal points
    coord_flip() +
    # set the theme to a basic white and black theme
    theme_bw() +
    # set the x-axis label to "Token" and y-axis label to "Keyness (G2)"
    labs(x = "Token", y = "Keyness (G2)")

Another option to visualize keyness is a barplot as shown below.

# get top 10 keywords for text 1
top <- assoc_tb3 %>%
    dplyr::ungroup() %>%
    dplyr::slice_head(n = 12)
# get top 10 keywords for text 2
bot <- assoc_tb3 %>%
    dplyr::ungroup() %>%
    dplyr::slice_tail(n = 12)
# combine into table
rbind(top, bot) %>%
    # create a ggplot
    ggplot(aes(x = reorder(token, G2, mean), y = G2, label = G2, fill = type)) +
    # add a bar plot using the 'phi' values
    geom_bar(stat = "identity") +
    # add text labels above the bars with rounded 'phi' values
    geom_text(aes(
        y = ifelse(G2 > 0, G2 - 50, G2 + 50),
        label = round(G2, 1)
    ), color = "white", size = 3) +
    # flip the coordinates to have horizontal bars
    coord_flip() +
    # set the theme to a basic white and black theme
    theme_bw() +
    # remove legend
    theme(legend.position = "none") +
    # define colors
    scale_fill_manual(values = c("orange", "darkgray")) +
    # set the x-axis label to "Token" and y-axis label to "Keyness (G2)"
    labs(title = "Top 10 keywords for text1 and text 2", x = "Keyword", y = "Keyness (G2)")

Comparative wordclouds

Another form of word clouds, known as comparison clouds, is helpful in discerning disparities between texts. The problem compared to previous, more informative methods for identifying keywords is that comparison clouds use a very basic and not very sophisticated methods for identifying keywords. Nonetheless, comparison clouds are very useful visualization tools during initial steps on an analysis.

In a first step, we generate a corpus object from the texts and create a variable with the author name.

corp_dom <- quanteda::corpus(c(text1, text2))
attr(corp_dom, "docvars")$Author <- c("Orwell", "Melville")

Now, we can remove so-called stopwords (non-lexical function words) and punctuation and generate the comparison cloud.

# create a comparison word cloud for a corpus
corp_dom %>%
    # tokenize the corpus, removing punctuation, symbols, and numbers
    quanteda::tokens(
        remove_punct = TRUE,
        remove_symbols = TRUE,
        remove_numbers = TRUE
    ) %>%
    # remove English stopwords
    quanteda::tokens_remove(stopwords("english")) %>%
    # create a Document-Feature Matrix (DFM)
    quanteda::dfm() %>%
    # group the DFM by the 'Author' column from 'corp_dom'
    quanteda::dfm_group(groups = corp_dom$Author) %>%
    # trim the DFM, keeping terms that occur at least 10 times
    quanteda::dfm_trim(min_termfreq = 10, verbose = FALSE) %>%
    # generate a comparison word cloud
    quanteda.textplots::textplot_wordcloud(
        # create a comparison word cloud
        comparison = TRUE,
        # set colors for different groups
        color = c("darkgray", "orange"),
        # define the maximum number of words to display in the word cloud
        max_words = 150
    )

Citation & Session Info

Schweinberger, Martin. 2025. Keyness and Keyword Analysis in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/coll/coll.html (Version 2025.04.01).

@manual{schweinberger2025key,
  author = {Schweinberger, Martin},
  title = {Keyness and Keyword Analysis in R},
  note = {https://ladal.edu.au/tutorials/coll/coll.html},
  year = {2025},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2025.04.01}
}

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggplot2_3.5.1             sna_2.8                  
 [3] network_1.19.0            statnet.common_4.11.0    
 [5] tm_0.7-16                 NLP_0.3-2                
 [7] stringr_1.5.1             dplyr_1.1.4              
 [9] quanteda.textplots_0.95   quanteda.textstats_0.97.2
[11] quanteda_4.2.0            Matrix_1.7-2             
[13] flextable_0.9.7          

loaded via a namespace (and not attached):
 [1] fastmatch_1.1-6         gtable_0.3.6            xfun_0.51              
 [4] htmlwidgets_1.6.4       lattice_0.22-6          vctrs_0.6.5            
 [7] tools_4.4.2             generics_0.1.3          parallel_4.4.2         
[10] klippy_0.0.0.9500       tibble_3.2.1            pkgconfig_2.0.3        
[13] data.table_1.17.0       assertthat_0.2.1        uuid_1.2-1             
[16] lifecycle_1.0.4         farver_2.1.2            compiler_4.4.2         
[19] textshaping_1.0.0       munsell_0.5.1           codetools_0.2-20       
[22] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.8.1      
[25] yaml_2.3.10             tidyr_1.3.1             pillar_1.10.1          
[28] openssl_2.3.2           fontBitstreamVera_0.1.1 stopwords_2.3          
[31] tidyselect_1.2.1        zip_2.3.2               digest_0.6.37          
[34] stringi_1.8.4           slam_0.1-55             purrr_1.0.4            
[37] labeling_0.4.3          fastmap_1.2.0           grid_4.4.2             
[40] colorspace_2.1-1        cli_3.6.4               magrittr_2.0.3         
[43] withr_3.0.2             gdtools_0.4.1           scales_1.3.0           
[46] rmarkdown_2.29          officer_0.6.7           askpass_1.2.1          
[49] ragg_1.3.3              coda_0.19-4.1           evaluate_1.0.3         
[52] knitr_1.49              rlang_1.1.5             Rcpp_1.0.14            
[55] nsyllable_1.0.1         glue_1.8.0              xml2_1.3.6             
[58] renv_1.1.1              rstudioapi_0.17.1       jsonlite_1.9.0         
[61] R6_2.6.1                systemfonts_1.2.1

Back to LADAL home

References

Egbert, Jesse, and Douglas Biber. 2019. “Incorporating Text Dispersion into Keyword Analysis.” Corpora 14 (1): 77–104. https://doi.org/10.3366/cor.2019.0162.

Sönning, Lukas. 2023. “Evaluation of Keyness Metrics: Performance and Reliability.” Corpus Linguistics and Linguistic Theory 0. https://doi.org/doi:10.1515/cllt-2022-0116.

Footnotes

I’m extremely grateful to Joseph Flanagan who provided very helpful feedback and pointed out errors in previous versions of this tutorial. All remaining errors are, of course, my own.↩︎