This tutorial introduces keyness and keyword analysis — a set of corpus-linguistic methods for identifying words that are statistically characteristic of one text or corpus when compared to another. Keywords play a pivotal role in text analysis, serving as distinctive terms that hold particular significance within a given text, context, or collection. These words stand out due to their heightened frequency in a specific text or context, setting them apart from their occurrence in another. In essence, keywords are linguistic markers that encapsulate the essence or topical focus of a document or dataset. The process of identifying keywords involves a methodology akin to the one employed for detecting collocations using kwics: we compare the use of a particular word in a target corpus A against its use in a reference corpus B. By discerning the frequency disparities, we gain valuable insights into the salient terms that contribute significantly to the unique character and thematic emphasis of a given text or context.1
This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to extract keywords from and analyze keywords in textual data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with keyness and keyword analysis.
Prerequisite Tutorials
To be able to follow this tutorial, we suggest you check out and familiarise yourself with the content of the following R Basics tutorials:
Familiarity with basic frequency analysis and with the concept of statistical significance testing will be particularly helpful for understanding the keyness statistics introduced in this tutorial.
Learning Objectives
By the end of this tutorial you will be able to:
Explain what a keyword is and how keyness analysis differs from simple frequency analysis
Describe the dimensions of keyness proposed by Egbert and Biber (2019) and Sønning (2023) — frequency vs. dispersion, and target-intrinsic vs. comparative
Construct the 2×2 contingency table that underlies all keyness statistics
Compute a comprehensive suite of keyness measures in R — G², χ², phi, MI, PMI, Log Odds Ratio, Rate Ratio, Rate Difference, Difference Coefficient, Odds Ratio, DeltaP, and Signed DKL
Apply Fisher’s Exact Test and Bonferroni correction to assess and control for statistical significance
Visualise keyword results using dot plots, bar plots, and comparison word clouds
Interpret types (overrepresented words) and antitypes (underrepresented words) substantively
Report keyword analyses in accordance with best-practice conventions in corpus linguistics
What This Tutorial Covers
Dimensions of keyness — frequency vs. dispersion, discernibility vs. distinctiveness, target-intrinsic vs. comparative approaches
The 2×2 contingency table — the logical and mathematical foundation of keyword identification
Significance testing — Fisher’s Exact Test and Bonferroni correction for multiple comparisons
Visualising keywords — dot plots, bar plots, and comparison word clouds
Reporting standards — what to report, model paragraphs, and a quick-reference checklist
Preparation and Session Set-up
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information on how to use R here. For this tutorial we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code in this section. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code — it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
Code
# set optionsoptions(stringsAsFactors = F)options(scipen =999)options(max.print =1000)# install packagesinstall.packages("checkdown")install.packages("flextable")install.packages("Matrix")install.packages("quanteda")install.packages("quanteda.textplots")install.packages("dplyr")install.packages("stringr")install.packages("tidyr")install.packages("tm")install.packages("ggplot2")
Next, we load the packages.
Code
# load packageslibrary(checkdown) # interactive quiz questionslibrary(flextable) # formatted tableslibrary(Matrix) # sparse matrix supportlibrary(quanteda) # corpus and tokenisation toolslibrary(quanteda.textplots) # word clouds and text visualisationslibrary(dplyr) # data manipulationlibrary(stringr) # string processinglibrary(tidyr) # data reshapinglibrary(tm) # stopword listslibrary(ggplot2) # data visualisation
Interactive Keyword Tool
KEYWORD TOOL
Click here to open an notebook-based tool that calculates keyness statistics and allows you to download the results.
How can you detect keywords — words that are characteristic of a text or a collection of texts?
This tutorial aims to show how you can answer this question.
Keywords
Section Overview
What you’ll learn: What keywords are, why they matter, and how keyword identification relates to frequency analysis
Why it matters: Understanding the logic of keyness is essential before computing any statistics — knowing what a keyword is tells you how to choose the right measure and how to interpret the results
Keywords play a central role in corpus linguistics and computational text analysis. In everyday language, the word keyword may mean simply an important or central word in a document. In corpus linguistics, however, the term has a more precise, comparative meaning: a keyword is a word whose frequency — or whose distribution — in a target corpus is statistically unusual compared to a reference corpus(Scott 1997; Stubbs 2010).
This comparative logic is fundamental. Consider the word whale: it will be extremely frequent in a corpus of whaling narratives (such as Melville’s Moby Dick) but far less common in dystopian fiction. Its relative excess in the whaling corpus is what makes it a keyword there — not its raw frequency per se, but its frequency relative to a baseline. The reference corpus serves as that baseline, providing an estimate of how often we would expect a given word to appear in text generally, against which we assess whether its occurrence in the target corpus is surprising.
The identification of keywords is used across a wide range of applications in linguistics and beyond, including:
Stylistic analysis — characterising an author’s distinctive vocabulary relative to contemporaries or a general corpus
Genre analysis — identifying what makes a genre lexically distinctive
Diachronic studies — tracking which words become more or less characteristic of a variety over time
Discourse analysis — revealing vocabulary associated with a particular social group or ideological position
Language pedagogy — identifying vocabulary that is key to a specific academic field or register
The Reference Corpus Matters
The reference corpus is not a neutral backdrop — it shapes every keyword that emerges from the analysis. A study comparing academic writing to news prose will produce very different keywords than one comparing the same academic texts to spoken conversation. Always report what your reference corpus is, justify why it is the appropriate baseline for your research question, and interpret all keywords in light of that choice.
Dimensions of Keyness
Section Overview
What you’ll learn: The theoretical framework for understanding different types of keyness — frequency-based vs. dispersion-based, and target-intrinsic vs. comparative
Key references:Egbert and Biber (2019); Sønning (2023)
Why it matters: Not all keyness measures capture the same property of language. Understanding the dimensions of keyness helps you choose the measure that best reflects your research question.
Before turning to the practicalities of computing keyness, it is worth considering what typicalness — the theoretical goal of keyness analysis — actually means. This question has received renewed attention in recent methodological work (Sønning 2023).
Keyness analysis identifies typical items in a discourse domain, where typicalness traditionally relates to frequency of occurrence: the emphasis is on items used more frequently in the target corpus compared to a reference corpus. Egbert and Biber (2019) expanded this notion by highlighting two distinct criteria for typicalness: content-distinctiveness and content-generalizability.
Content-distinctiveness refers to an item’s association with the domain and its topical relevance — how much more (or less) it is used in the target than in a reference corpus.
Content-generalizability pertains to an item’s widespread usage across various texts within the target domain — whether the word surfaces broadly or is concentrated in just a handful of documents.
These criteria bridge traditional keyness approaches with broader linguistic perspectives, emphasising both the distinctiveness and the generalizability of key items within a corpus.
Following Sønning (2023), we can adopt Egbert and Biber (2019)’s keyness criteria and distinguish between frequency-oriented and dispersion-oriented approaches to assess keyness. We can also distinguish between keyness features that are assessed relative to the target variety only (target-intrinsic) and those that emerge only from a comparison to a reference variety (comparative). This four-way classification, detailed in the table below, links methodological choices to the linguistic meaning conveyed by quantitative measures:
Analysis
Frequency-oriented
Dispersion-oriented
Target variety in isolation
Discernibility of item in the target variety
Generality across texts in the target variety
Comparison to reference variety
Distinctiveness relative to the reference variety
Comparative generality relative to the reference variety
The second key aspect of keyness involves an item’s dispersion across texts in the target domain, indicating its widespread use. A typical item should appear evenly across various texts within the target domain, reflecting its generality. This breadth of usage can be compared to its occurrence in the reference domain — termed comparative generality. Therefore, a key item should exhibit greater prevalence across target texts compared to those in the reference domain.
In this tutorial we focus primarily on the frequency-comparative quadrant: identifying words that are significantly more (or less) frequent in the target corpus than in the reference corpus. This is by far the most commonly implemented approach in corpus-linguistic research and the one found in tools such as AntConc, WordSmith Tools, and Sketch Engine. Dispersion-based approaches are an important complementary perspective but are beyond the scope of this introductory tutorial.
Exercises: Dimensions of Keyness
Q1. A word appears 800 times in a target corpus of 200,000 tokens, but it also appears very frequently in the reference corpus in proportion to its size. Is this word necessarily a keyword?
Q2. What is the difference between content-distinctiveness and content-generalizability as described by Egbert & Biber (2019)?
Identifying Keywords
Section Overview
What you’ll learn: The logical and mathematical structure of keyword identification — how the 2×2 contingency table works and what information it captures
Why it matters: Every keyness statistic — from G² to MI to the Log Odds Ratio — is computed from this same table. Understanding it is the key to understanding all measures.
Here, we focus on a frequency-based approach that assesses distinctiveness relative to the reference variety. To identify these keywords, we follow the procedure used to identify collocations using kwics — the idea is essentially identical: we compare the use of a word in a target corpus A to its use in a reference corpus B.
To determine if a token is a keyword — whether it occurs significantly more frequently in a target corpus compared to a reference corpus — we use the following information arranged in a 2×2 contingency table:
O11 = Number of times wordx occurs in the target corpus
O12 = Number of times wordx occurs in the reference corpus (without target corpus)
O21 = Number of times other words occur in the target corpus
O22 = Number of times other words occur in the reference corpus
Target corpus
Reference corpus
Row total
token
O11
O12
= R1
other tokens
O21
O22
= R2
Column total
= C1
= C2
= N
From these observed counts we compute expected frequencies — the counts we would expect if wordx were distributed in exact proportion to the sizes of the two corpora (i.e., the null hypothesis of no keyness):
If the observed O11 substantially exceeds E11, the word appears more often in the target than chance would predict: it is a candidate keyword, also called a type. If O11 is substantially below E11, the word is underrepresented in the target: it is an antitype — a keyword of the reference corpus.
Types and Antitypes
Both directions of keyness are substantively informative:
A type is a word used significantly more in the target corpus than expected — it characterises the target.
An antitype is a word used significantly less in the target corpus than expected — it characterises the reference corpus, or equivalently, is avoided in the target.
Antitypes can reveal what a text or genre systematically avoids saying, which is often as theoretically meaningful as what it uses abundantly. For example, if we compare political speeches to news reporting, words significantly avoided in speeches (antitypes) can illuminate strategic communicative choices.
Data: Two Literary Texts
We begin with loading two texts. text1 is our target and text2 is our reference.
We inspect the first 200 characters of each text to confirm what we are working with:
.
1984 George Orwell Part 1, Chapter 1 It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, sli
As you can see, text1 is George Orwell’s Nineteen Eighty-Four.
.
MOBY-DICK; or, THE WHALE. By Herman Melville CHAPTER 1. Loomings. Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interes
The table shows that text2 is Herman Melville’s Moby Dick. These two novels are chosen because they are stylistically and thematically very different — one a mid-twentieth-century dystopian political novel, the other a nineteenth-century nautical adventure — which produces clear and interpretable keywords, making them ideal for illustrative purposes.
Computing Keyness Statistics
Section Overview
What you’ll learn: How to tokenise two texts, build frequency and contingency tables, and calculate a comprehensive suite of keyness measures in R — step by step
Why it matters: Building the analysis from scratch means you understand exactly what each step does and can adapt it to your own corpora and research questions
After loading the two texts, we create a frequency table of the first text (the target).
Code
text1_words <- text1 %>%# remove non-word characters stringr::str_remove_all("[^[:alpha:] ]") %>%# convert to lower casetolower() %>%# tokenize the corpus files quanteda::tokens(remove_punct =TRUE,remove_symbols =TRUE,remove_numbers =TRUE ) %>%# unlist the tokens to create a data frameunlist() %>%as.data.frame() %>%# rename the column to 'token' dplyr::rename(token =1) %>%# group by 'token' and count occurrences dplyr::group_by(token) %>% dplyr::summarise(n =n()) %>%# add column stating where the frequency list is 'from' dplyr::mutate(type ="text1")
Now, we create a frequency table for the second text (the reference).
Code
text2_words <- text2 %>%# remove non-word characters stringr::str_remove_all("[^[:alpha:] ]") %>%# convert to lower casetolower() %>%# tokenize the corpus files quanteda::tokens(remove_punct =TRUE,remove_symbols =TRUE,remove_numbers =TRUE ) %>%# unlist the tokens to create a data frameunlist() %>%as.data.frame() %>%# rename the column to 'token' dplyr::rename(token =1) %>%# group by 'token' and count occurrences dplyr::group_by(token) %>% dplyr::summarise(n =n()) %>%# add column stating where the frequency list is 'from' dplyr::mutate(type ="text2")
In a next step, we combine the two frequency tables. We use a left join so that every word from the target corpus appears in the combined table, with a zero count assigned to words that do not appear in the reference corpus.
Code
texts_df <- dplyr::left_join(text1_words, text2_words, by =c("token")) %>%# rename columns and select relevant columns dplyr::rename(text1 = n.x,text2 = n.y ) %>% dplyr::select(-type.x, -type.y) %>%# replace NA values with 0 tidyr::replace_na(list(text1 =0, text2 =0))
token
text1
text2
a
2,390
4,536
aaronson
8
0
aback
2
2
abandon
3
3
abandoned
4
7
abashed
1
2
abbreviated
1
0
abiding
1
1
ability
1
1
abject
3
0
We now calculate the observed and expected frequencies as well as the row and column totals needed to fill the 2×2 contingency table for each word.
The table above shows the keywords for text1, which is George Orwell’s Nineteen Eighty-Four. The table starts with token (word type), followed by type, which indicates whether the token is a keyword in the target data (type) or a keyword in the reference data (antitype). Next is the Bonferroni-corrected significance (Sig_corrected), which accounts for repeated testing. This is followed by O11 (observed frequency of the token in the target corpus), and then by the various keyness statistics, which are explained in detail in the next section.
Exercises: Computing Keyness
Q1. In the keyword contingency table, what does O11 represent?
Q2. Why is a small offset (e.g., +0.1) added to zero-count cells before calculating keyness statistics?
Q3. What does it mean for a word to be an antitype in a keyword analysis?
Keyness Measures Explained
Section Overview
What you’ll learn: What each keyness statistic measures conceptually, its mathematical formula, and when it is most appropriate to use
Why it matters: Different keyness measures capture different aspects of the relationship between a word and a corpus. Knowing what each one does allows you to make principled choices and report results accurately.
This section explains each of the statistics produced by the code above. Understanding these measures allows you to choose the most appropriate one for your research question and to interpret results correctly. These measures help analyse the association strength and significance of a token’s attraction to the target rather than the reference corpus.
Delta P (ΔP)
Delta P is a measure of association that indicates the difference in conditional probabilities. It measures the strength and direction of the association between a word and corpus membership:
\[\Delta P(A|B) = P(A|B) - P(A|\neg B) \quad \Rightarrow \quad \Delta P = \frac{O_{11}}{R_1} - \frac{O_{21}}{R_2}\]
Where \(P(A|B)\) is the probability of A given B, and \(P(A|\neg B)\) is the probability of A given not-B. Delta P ranges from −1 to +1 and is increasingly recommended in corpus-linguistic work (Gries 2013).
Log Odds Ratio
The Log Odds Ratio measures the strength of association between a word and the target corpus. It is the natural logarithm of the odds ratio and provides a symmetric measure. The +0.5 offsets (Haldane–Anscombe correction) handle zero-count cells:
Positive values indicate overrepresentation in the target; negative values indicate underrepresentation. The Log Odds Ratio is particularly attractive because it is symmetric, interpretable as an effect size, and amenable to confidence interval construction.
Mutual Information (MI)
Mutual Information quantifies the amount of information obtained about corpus membership through knowing the word. It measures mutual dependence between the word and the corpus:
MI is highly sensitive to low-frequency items: a word appearing only once or twice in the target but never in the reference will receive an extremely high MI score. It therefore tends to favour rare, highly specific words over more general but robustly frequent keywords. Use MI with a minimum frequency filter.
Pointwise Mutual Information (PMI)
Pointwise Mutual Information measures the association between the specific word and the target corpus as point-events:
Like MI, PMI is sensitive to low-frequency words, though in slightly different ways depending on the implementation. Both MI and PMI are better used as ranking or ordering metrics than as standalone significance tests.
Phi (φ) Coefficient
The phi coefficient is a scale-free effect size for the association between a word and corpus membership:
\[\phi = \sqrt{\frac{\chi^2}{N}}\]
Where \(n_{ij}\) are the counts in each cell. Phi ranges from 0 (no association) to 1 (perfect association), and is signed here to indicate direction (positive = type, negative = antitype). Because phi is not influenced by sample size, it is valuable for comparing keyness strength across words or studies.
Chi-Square (χ²)
Pearson’s chi-square tests the independence of the word’s distribution from corpus membership:
\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]
It shares the same distributional logic as G² but is less robust when expected cell frequencies fall below 5 — which is common for rare words in large corpora. For most corpus-linguistic keyness applications, G² is preferred over χ².
Likelihood Ratio (G²)
The log-likelihood ratio statistic (G²) is the most widely recommended keyness measure in corpus linguistics (Dunning 1993). It compares how much better the data fit a model where the word has different rates in the two corpora versus a model assuming a single pooled rate:
G² follows an approximate chi-square distribution, making significance assessment straightforward. Unlike Pearson’s χ², G² performs well even when expected cell frequencies are low, making it more robust for rare words.
Rate Ratio
The Rate Ratio compares the rate of events between two groups — here, the per-thousand-word frequencies in the target and reference corpora:
A Rate Ratio of 3.0 means the word appears three times more frequently per thousand words in the target than in the reference. It is intuitive and easy to communicate to non-specialist audiences. A small offset (+0.001) avoids division by zero for words absent from the reference.
Rate Difference
The Rate Difference measures the absolute difference in per-thousand-word event rates:
While the Rate Ratio is relative (multiplicative), the Rate Difference is absolute (additive). Both capture useful but distinct aspects of how usage rates differ.
Difference Coefficient
The Difference Coefficient (also known as the Difference Score) normalises the Rate Difference by the sum of the two rates:
Values above 1 indicate overrepresentation in the target; values below 1 indicate underrepresentation. The log transformation (Log Odds Ratio, above) is usually preferred because it is symmetric around zero.
Log-Likelihood Ratio (LLR)
The LLR as implemented here is a simplified form that focuses on the target word’s contribution to the full G² statistic:
It is signed to indicate direction (positive = more frequent in target; negative = more frequent in reference).
Significance and Multiple Testing
All keyness statistics above measure association strength, but to determine whether a keyword is statistically significant we need a hypothesis test. The code uses Fisher’s Exact Test, which computes the exact probability of observing a contingency table as extreme as the one observed under the null hypothesis of no association. This is more reliable than the asymptotic chi-square approximation, especially for words with small expected counts.
Bonferroni Correction for Multiple Testing
When testing thousands of words simultaneously, some will appear significant purely by chance. If we test 10,000 words at α = .05, we expect roughly 500 false positives even if no word is truly a keyword. The Bonferroni correction addresses this by dividing the significance threshold by the number of tests performed: αcorrected = α / k, where k is the number of word types tested.
In the output table, the corrected significance tiers are:
Label
Meaning
p < .001***
p / k ≤ .001 — very strong evidence against H₀
p < .01**
p / k ≤ .01
p < .05*
p / k ≤ .05
n.s.
Not significant after Bonferroni correction — excluded from results
The Bonferroni correction is conservative (it increases the risk of false negatives alongside reducing false positives). An alternative that controls the False Discovery Rate (FDR) rather than the family-wise error rate is the Benjamini–Hochberg procedure, which offers more statistical power at the cost of allowing a small proportion of false positives among the significant results.
Exercises: Keyness Measures
Q1. Why might Mutual Information (MI) not be the best default measure for identifying keywords in a large corpus?
Q2. G² = 45.3 (p < .001, Bonferroni-corrected). What does this tell us?
Q3. A Rate Ratio of 0.15 for a word in a keyword analysis of text1 vs. text2 means:
Visualising Keywords
Section Overview
What you’ll learn: How to create and interpret three complementary visualisations of keyword results — dot plots, bar plots, and comparison word clouds
Why visualisation matters: A table with thousands of rows of keyness statistics is difficult to scan; visualisations make patterns immediately communicable and allow you to identify the most important results at a glance
Dot Plot
We can visualise the keyness strengths in a dot plot as shown in the code below. Sorting by G² in descending order and selecting the top 20 types gives us the words most strongly characteristic of Orwell’s Nineteen Eighty-Four.
Code
assoc_tb3 %>% dplyr::filter(type =="type") %>% dplyr::arrange(-G2) %>%head(20) %>%ggplot(aes(x =reorder(token, G2, mean), y = G2)) +geom_point(color ="steelblue", size =3) +geom_segment(aes(xend = token, y =0, yend = G2),color ="steelblue", linewidth =0.7) +coord_flip() +theme_bw() +theme(panel.grid.minor =element_blank()) +labs(title ="Top 20 keywords of Orwell's Nineteen Eighty-Four",subtitle ="Compared to Melville's Moby Dick | sorted by G² (log-likelihood)",x ="Token", y ="Keyness (G²)" )
The dot plot shows that words like party, winston, telescreen, and thought are among the most distinctive terms in Nineteen Eighty-Four — words that encapsulate the novel’s preoccupation with totalitarian control, surveillance, and political conformity.
Bar Plot
Another option is to visualise keyness as a bar plot that simultaneously shows the top keywords for each text. We display the 12 strongest types (keywords of text1) and 12 strongest antitypes (keywords of text2) in a single panel, making the contrasting vocabularies of the two novels immediately apparent.
Code
# get top 12 keywords for text1 (types)top <- assoc_tb3 %>% dplyr::ungroup() %>% dplyr::filter(type =="type") %>% dplyr::slice_head(n =12)# get top 12 keywords for text2 (antitypes of text1)bot <- assoc_tb3 %>% dplyr::ungroup() %>% dplyr::filter(type =="antitype") %>% dplyr::slice_tail(n =12)# combine and plotrbind(top, bot) %>%ggplot(aes(x =reorder(token, G2, mean), y = G2,label =round(G2, 1), fill = type)) +geom_bar(stat ="identity") +geom_text(aes(y =ifelse(G2 >0, G2 -max(abs(G2)) *0.04, G2 +max(abs(G2)) *0.04),label =round(G2, 1) ), color ="white", size =3) +coord_flip() +theme_bw() +theme(legend.position ="none",panel.grid.minor =element_blank()) +scale_fill_manual(values =c("antitype"="orange", "type"="steelblue")) +labs(title ="Top keywords (blue) and antitypes (orange)",subtitle ="Target: Orwell's Nineteen Eighty-Four | Reference: Melville's Moby Dick",x ="Keyword", y ="Keyness (G²)" )
Bars extending to the right (blue) show the strongest keywords of Nineteen Eighty-Four; bars extending to the left (orange) show words characteristic of Moby Dick that are underrepresented in Orwell. The contrast is striking: Melville’s distinctive vocabulary (whale, ship, sea, ahab) reflects the nautical world of the novel, while Orwell’s keywords (party, winston, telescreen) evoke the dystopian political landscape of Nineteen Eighty-Four.
Comparative Word Clouds
Another form of word clouds, known as comparison clouds, is helpful in discerning disparities between texts. The problem compared to previous, more rigorous methods for identifying keywords is that comparison clouds use a very basic and not very sophisticated method for identifying distinctive words. Nonetheless, comparison clouds are very useful visualisation tools during the initial steps of an analysis.
In a first step, we generate a corpus object from the texts and create a variable with the author name.
Now, we can remove so-called stopwords (non-lexical function words) and punctuation and generate the comparison cloud.
Code
# create a comparison word cloud for a corpuscorp_dom %>%# tokenize the corpus, removing punctuation, symbols, and numbers quanteda::tokens(remove_punct =TRUE,remove_symbols =TRUE,remove_numbers =TRUE ) %>%# remove English stopwords quanteda::tokens_remove(stopwords("english")) %>%# create a Document-Feature Matrix (DFM) quanteda::dfm() %>%# group the DFM by the 'Author' column quanteda::dfm_group(groups = corp_dom$Author) %>%# trim the DFM, keeping terms that occur at least 10 times quanteda::dfm_trim(min_termfreq =10, verbose =FALSE) %>%# generate a comparison word cloud quanteda.textplots::textplot_wordcloud(comparison =TRUE,color =c("darkgray", "orange"),max_words =150 )
Interpreting Comparison Word Clouds Cautiously
Comparison word clouds use a simplified keyness algorithm that does not apply multiple testing correction and does not distinguish between statistical significance and visual prominence. They should be used for exploration or illustration rather than as the primary or sole evidence for research claims. Always accompany word clouds with the full statistical keyword table, and report statistics (G², phi, etc.) for any keywords you discuss substantively.
Exercises: Visualising Keywords
Q1. In the bar plot of keywords and antitypes, what does a bar extending to the left (negative G²) represent?
Q2. Why are comparison word clouds considered a less rigorous method of keyword identification than the statistical approach demonstrated earlier?
Reporting Standards
Reporting keyword analyses clearly and completely is as important as conducting them correctly. This section summarises conventions for reporting keyness analyses in corpus linguistics and adjacent fields.
Describe both the target and reference corpora: their source, composition, size in tokens, and any relevant metadata (e.g., time period, genre, sampling frame)
State all preprocessing steps: tokenisation method, case normalisation, stopword removal, lemmatisation
Justify the choice of reference corpus relative to the specific research question
Statistical choices
Name the keyness measure(s) used and cite a methodological reference (e.g., G²: Dunning (1993))
State the significance test used (Fisher’s Exact Test or asymptotic chi-square approximation)
State whether and how you corrected for multiple testing (e.g., Bonferroni correction: αcorrected = .05 / k)
Report any minimum frequency thresholds applied before ranking
Results
Report the keyness statistic (G²), the Bonferroni-corrected significance level, and at least one effect size (phi, Log Odds Ratio, or Rate Ratio) for each keyword discussed in detail
Report both types and antitypes if they are relevant to the research question
Provide a full keyword table in the paper (or as supplementary material if space is constrained)
Interpret keywords substantively — connect them to the theoretical or linguistic claims of the study
Model Reporting Paragraph
To identify the lexical characteristics of Orwell’s Nineteen Eighty-Four relative to Melville’s Moby Dick, a keyword analysis was conducted using the log-likelihood statistic (G²; Dunning (1993)). Fisher’s Exact Test was used to assess statistical significance, with a Bonferroni correction applied to control for multiple comparisons across all word types tested (αcorrected = .05 / k). Only words reaching the corrected threshold of p < .001 are reported. Effect sizes are reported as phi (φ). The strongest keywords of Nineteen Eighty-Four included party (G² = [X], φ = [X], p < .001), winston (G² = [X], φ = [X], p < .001), and telescreen (G² = [X], φ = [X], p < .001), reflecting the novel’s preoccupation with political control and surveillance. Prominent antitypes — words significantly underrepresented in Nineteen Eighty-Four relative to Moby Dick — included whale and ship, consistent with the nautical thematic focus of the reference text.
Quick Reference: Keyness Measures
Measure
Strengths
Use with caution when
G² (Log-Likelihood)
Robust for rare words; best general-purpose keyness test; widely used
Large N inflates significance — always pair with an effect size such as phi
χ² (Chi-Square)
Widely known; same distributional logic as G²
Expected cell frequencies < 5 (use G² instead)
Phi (φ)
Scale-free effect size; comparable across words and studies; not N-inflated
Used alone — does not test statistical significance
MI (Mutual Information)
Highlights highly specific, narrowly targeted words
No frequency filter applied — strongly favours hapax legomena
PMI
Interpretable in information-theoretic terms
No frequency filter applied — also favours rare words
Log Odds Ratio
Symmetric; amenable to CIs; recommended effect size for keyness
Zero cells exist without Haldane correction (+0.5 offset needed)
Rate Ratio
Intuitive; easy to communicate to non-specialist audiences
Base rates in the two corpora differ greatly
Rate Difference
Shows absolute magnitude of frequency difference
Comparing across words with very different base frequencies
Difference Coefficient
Bounded [−1, +1]; accounts for base rate differences
Both rates are near zero (arithmetic instability)
Odds Ratio
Familiar from epidemiology; simple ratio
Asymmetric on raw scale — log transformation preferred
DeltaP (ΔP)
Bounded [−1, +1]; grounded in conditional probability
Less commonly reported; reviewers may be unfamiliar with it
Signed DKL
Information-theoretic; sensitive to distributional divergence
Implementation details vary across software — document formula used
Reporting Checklist
Reporting item
Required
Target corpus described (source, size in tokens, composition)
Yes
Reference corpus described and choice justified relative to research question
Yes
All preprocessing steps reported (tokenisation, case, stopwords, lemmatisation)
Yes
Keyness measure named and a methodological reference cited
Yes
Significance test specified (Fisher's Exact Test or chi-square p-value)
Yes
Multiple testing correction applied and reported (Bonferroni or FDR)
Yes
Minimum frequency threshold stated (if applied before ranking)
Recommended
Both types and antitypes considered and discussed where relevant
Full keyword table provided or referenced as supplementary material
Yes
Keywords interpreted substantively in relation to the research question
Yes
Citation & Session Info
Schweinberger, Martin. 2026. Keyness and Keyword Analysis in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/key/key.html (Version 2026.02.24).
@manual{schweinberger2026key,
author = {Schweinberger, Martin},
title = {Keyness and Keyword Analysis in R},
note = {tutorials/key/key.html},
year = {2026},
organization = {The University of Queensland, School of Languages and Cultures},
address = {Brisbane},
edition = {2026.02.24}
}
Dunning, Ted. 1993. “Accurate Methods for the Statistics of Surprise and Coincidence.”Computational Linguistics 19 (1): 61–74.
Egbert, Jesse, and Douglas Biber. 2019. “Incorporating Text Dispersion into Keyword Analyses.”Corpora 14 (1): 77–104.
Gries, Stefan Th. 2013. Statistics for Linguistics with R: A Practical Introduction. 2nd ed. Berlin: De Gruyter Mouton.
Hunston, Susan. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Scott, Mike. 1997. “PC Analysis of Key Words — and Key Key Words.”System 25 (2): 233–45.
Sønning, Lukas. 2023. “Keyword Analysis in Corpus Linguistics: Rethinking the Foundations.”Corpora 18 (2): 1–31.
Stubbs, Michael. 2010. “Three Concepts of Keywords.” In Keyness in Texts, edited by Marina Bondi and Mike Scott, 1–42. Amsterdam: John Benjamins.
AI Transparency Statement
This tutorial was developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise, expand, and improve the tutorial text; add and structure new sections (learning objectives, prerequisite tutorials, section overview callout boxes, detailed explanations of each keyness measure, the types/antitypes callout, reporting guidelines, and quick-reference tables); write the checkdown multiple-choice exercises with detailed right/wrong feedback; and refine the ggplot2 visualisations. All original code and analysis logic from the draft tutorial have been preserved and integrated. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy, completeness, and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.
Footnotes
I am extremely grateful to Joseph Flanagan, who provided very helpful feedback and pointed out errors in previous versions of this tutorial. All remaining errors are, of course, my own.↩︎