Finding Words in Text: Concordancing with R
The Alice text must be downloaded once and saved before knitting. Run this code once in your console (not in the tutorial itself):
rawtext <- readLines("https://www.gutenberg.org/files/11/11-0.txt")
dir.create("tutorials/kwics/data", recursive = TRUE, showWarnings = FALSE)
save(rawtext, file = "tutorials/kwics/data/alice.rda")This creates tutorials/kwics/data/alice.rda, which the tutorial loads at knit time.

Introduction

This tutorial introduces concordancing — one of the most fundamental and powerful methods in corpus linguistics. Concordancing allows researchers to search systematically through large text collections, extracting every occurrence of a word or phrase together with the surrounding context. The resulting display, known as a keyword-in-context (KWIC) display, makes patterns of language use visible that would be impossible to detect through ordinary reading.
The tutorial covers the core concepts of concordancing, a survey of available tools from desktop software to web interfaces to R, and a hands-on practical guide to extracting, filtering, sorting, and analysing concordances using the quanteda package. It includes a section on working with spoken language transcripts and demonstrates how to build a custom concordance function that extends quanteda’s built-in capabilities.
Before working through this tutorial, we recommend familiarity with:
- Getting Started with R — R objects, basic syntax, RStudio orientation
- Loading and Saving Data in R — reading files into R
- String Processing in R — working with text using
stringr - Regular Expressions in R — pattern matching with regex
By the end of this tutorial you will be able to:
- Explain what concordancing is, how the KWIC display works, and why it is central to corpus linguistics
- Navigate the landscape of concordancing tools and choose the right tool for different research tasks
- Load and preprocess text data for concordancing in R
- Extract keyword-in-context concordances using
quanteda::kwic() - Use regular expressions to search for morphological variants and complex patterns
- Filter and sort concordances using
dplyrpipelines to reveal collocational patterns - Work with spoken language transcripts and handle annotation markup
- Build and use a custom concordance function for character-based context windows
- Export concordances in Excel, CSV, and other formats for further analysis
Schweinberger, Martin. 2026. Finding Words in Text: Concordancing with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/kwics/kwics.html (Version 2026.05.01).
An interactive Binder notebook that lets you upload your own texts and run the concordancing code without installing R is available here:
Click here to open the interactive concordancing notebook.
What Is Concordancing?
What you will learn: What a concordance is and how the KWIC display works; why concordancing is central to corpus linguistics; how concordances bridge quantitative and qualitative analysis; and a survey of application areas from linguistics to lexicography to digital humanities
The KWIC Display
A concordance is a systematic list of every occurrence of a search term in a text or corpus, presented with its surrounding context. The standard presentation format is the keyword-in-context (KWIC) display, in which the search term — the node word — appears aligned in the centre of each line, with a fixed window of context on either side:
...couldn't help thinking there must be more to life than being merely
...you are my density. I must go. Please excuse me. I mean, my
...the situation requires that we must work together to achieve our aims.
...extraordinary claims require extraordinary evidence before they are
...but the Emperor has no clothes! must speak truth to power and be heard
This deceptively simple layout makes linguistic patterns visible that are impossible to detect through ordinary reading. By displaying multiple instances simultaneously and aligning the node word vertically, concordances allow researchers to:
- observe how words are actually used rather than how we imagine they are used
- identify collocational patterns — words that systematically appear nearby
- distinguish different meanings or senses of a polysemous word from context
- examine grammatical constructions that a word participates in
- compare register and genre variation across different text types
From Intuition to Evidence
One of the most important contributions of concordancing to linguistics is the systematic correction of native-speaker intuition. Speakers often hold confident but inaccurate beliefs about their own language use. Concordances provide observable, verifiable evidence that can confirm, nuance, or directly contradict such intuitions.
Some well-documented intuition-corpus mismatches include:
- speakers typically believe they use very more than really, but corpus evidence frequently shows the reverse
- formal writing is assumed to avoid contractions, yet concordances reveal they are common in specific formal genres
- particular collocations assumed to be rare prove pervasive in specific registers once a corpus is examined
This empirical grounding is what distinguishes corpus linguistics from introspection-based approaches and makes concordancing indispensable for rigorous language research.
What Concordances Reveal
Concordancing supports several distinct types of linguistic investigation:
Semantic analysis — concordances allow researchers to identify the different senses of polysemous words, observe how context disambiguates meaning, and study semantic prosody: the tendency of words to collocate preferentially with words that carry a positive or negative evaluative charge. The word cause, for instance, collocates predominantly with negative nouns (cause harm, cause problems, cause damage) even though the word itself is semantically neutral.
Collocational analysis — the study of words that habitually co-occur is one of the core contributions of corpus linguistics to lexicography and language description. Concordances make it possible to build collocational profiles for words and to use association measures (mutual information, log-likelihood) to distinguish significant co-occurrences from accidental ones.
Grammatical investigation — concordances reveal verb complementation patterns (does this verb prefer an infinitival or a gerundive complement?), preposition selection, word order preferences, and evidence for grammaticalization processes.
Discourse and framing analysis — by examining what vocabulary surrounds key terms, researchers can study how concepts are constructed and contested in public discourse, identify ideological positioning through word choice, and track the evolution of discursive strategies over time.
Historical linguistics — applied to diachronic corpora, concordancing can document language change, track the semantic bleaching or expansion of words over centuries, and identify grammaticalization paths.
The Quantitative–Qualitative Bridge
Concordancing is unusual among research methods in that it supports both quantitative and qualitative analysis within a single workflow. The quantitative dimension includes frequency counts (how often does the pattern occur?), collocational strength statistics, and distributional comparisons across sub-corpora. The qualitative dimension involves close reading of individual concordance lines, interpretation of pragmatic effects, and recognition of contextual nuance.
This combination makes concordancing particularly well suited to mixed-methods research — studies that require both the breadth of corpus evidence and the depth of interpretive analysis.
Application Areas
Concordancing serves researchers and practitioners across a wide range of disciplines:
Corpus linguistics — documenting authentic language use at scale, testing theoretical claims against corpus evidence, and building usage-based grammatical descriptions.
Sociolinguistics — comparing language use across social groups (age, gender, region, register), studying style-shifting, and investigating language and identity.
Historical linguistics — tracking semantic change over time, documenting grammaticalization, and studying obsolescence and innovation.
Language teaching (Data-Driven Learning) — rather than presenting rules abstractly, DDL approaches have learners discover patterns through guided concordance analysis, building more robust intuitions about authentic usage.
Literary and stylistic analysis — concordances support authorship attribution, the study of recurring motifs and themes, and analysis of an author’s stylistic evolution across a career.
Translation studies — parallel concordancing of source and target texts helps translators find consistent equivalents, identify translation strategies, and maintain terminology consistency across large projects.
Lexicography — modern dictionaries rely on corpus evidence obtained through concordancing to identify word senses, document collocations, find authentic example sentences, and discover new words and meanings.
Content analysis and digital humanities — tracking how concepts are discussed in media, studying framing and ideology through keyword analysis, and examining the historical evolution of key terms.
Q1. A researcher notices that the word risk appears frequently in a corpus of financial news. She searches for it in a concordance and finds it consistently appears with words like exposure, mitigation, management, and assessment. What type of linguistic phenomenon is she observing, and what does it tell her about the word?
Q2. A student claims that concordancing is just a fancy search function — no different from Ctrl+F in a word processor. What is the most important limitation of Ctrl+F that concordancing overcomes?
Concordancing Tools
What you will learn: The main categories of concordancing tool — desktop software, web-based corpus interfaces, and programming environments; the strengths and appropriate use cases of each; and why R with quanteda is the recommended environment for reproducible research
Desktop Concordancing Software
Desktop concordancing applications provide powerful functionality without requiring programming knowledge and remain the most widely used tools in teaching and exploratory research.
AntConc (laurenceanthony.net) is the most widely used free concordancing tool. It is cross-platform (Windows, Mac, Linux), requires no installation dependencies, and provides an intuitive interface ideal for teaching and exploratory analysis. Beyond concordancing, it offers collocate analysis, cluster/n-gram extraction, keyword comparison across corpora, and a dispersion plot view. Its main limitations are modest statistical capabilities, limited export options, and no integration with other analytical environments. AntConc is the recommended starting point for researchers new to concordancing.
WordSmith Tools (lexically.net) is commercial software that has long been the professional standard in corpus linguistics. It provides a comprehensive suite of interconnected tools including concordancing, keyword analysis, and dispersion plots, with sophisticated built-in statistics and professional-quality visualisations. Its main limitations are cost (though institutional licences are available) and Windows-only operation.
SketchEngine (sketchengine.eu) is a web and desktop tool with access to pre-loaded corpora in over 90 languages, automatic corpus annotation, a distinctive “Word Sketch” display showing grammatical and collocational behaviour at a glance, and collaboration features for team research. It is subscription-based but widely used in professional and institutional contexts.
ParaConc is purpose-built for parallel texts (source and translation), allowing researchers to search both sides simultaneously and identify translation strategies — essential for translation studies research.
Web-Based Corpus Interfaces
Many large, professionally designed reference corpora are accessible through web interfaces, eliminating setup overhead and providing immediate access to billions of words of annotated text.
The BYU/English-Corpora.org family (english-corpora.org) created by Mark Davies includes several industry-standard reference corpora for English:
- COCA (Corpus of Contemporary American English) — over 1 billion words (1990–present), balanced across spoken, fiction, magazine, newspaper, and academic genres, updated annually
- COHA (Corpus of Historical American English) — 400+ million words (1820s–2000s) for diachronic research
- NOW Corpus — continuously updated web corpus for tracking very recent language change
These interfaces offer sophisticated search, metadata filtering (by genre, date, etc.), frequency trend visualisations, and free academic access. Their main limitation is that researchers cannot upload their own texts.
Lextutor (lextutor.ca) provides free web-based concordancing alongside vocabulary profiling tools, well suited to classroom use and quick lookups without any installation.
Why R and quanteda?
Given the range of excellent GUI tools available, why invest in learning R for concordancing? The answer lies in four properties that become increasingly important as research grows in scale and ambition:
Reproducibility is the most compelling reason. With a GUI tool, the workflow must be documented in prose (“I opened AntConc, loaded corpus X, searched for Y…”) and is difficult for others to replicate exactly. With an R script, every step is explicit, executable, and shareable. Scripts can be submitted alongside manuscripts, satisfying open science requirements and enabling exact replication.
Integration — R allows concordancing to be embedded in a seamless analytical pipeline. Text can be scraped from the web, cleaned, concordanced, collocates can be tested for statistical significance, results can be visualised with ggplot2, and statistical models can be fit — all within a single environment, without the error-prone import/export steps required when switching between tools.
Flexibility and customisation — R allows arbitrary filtering logic, novel analysis approaches, and automation of repetitive tasks that would be tedious or impossible in GUI tools. This tutorial’s custom concordance function (Section 10) illustrates how quanteda can be extended for specific research needs.
Scalability — R handles corpora of any size efficiently, supports parallel processing, and can be deployed on cloud computing infrastructure for very large-scale projects.
R’s advantages come with a learning investment. For quick one-off explorations, classroom demonstrations with non-technical audiences, or analyses that are genuinely simple, GUI tools like AntConc or COCA are faster and more practical. Many experienced corpus linguists use both: GUI tools for rapid exploration and R for the final reproducible analysis.
Q3. A postgraduate student is conducting a diachronic study of the word awful across 200 years of American English. She has no programming experience. Which tool is most directly suited to her needs, and why?
Q4. A research team plans to publish a corpus study of hedging in academic writing. They have built their own corpus from PDFs and need to submit their complete workflow alongside the manuscript. Which approach is most appropriate?
Setup
Installing Packages
Code
# Run once — comment out after installation
install.packages(c(
"quanteda", # core concordancing and tokenisation
"dplyr", # data manipulation
"stringr", # string processing
"writexl", # Excel export
"here", # portable file paths
"flextable", # formatted tables
"tidyr", # data reshaping
"ggplot2", # visualisation
"checkdown" # interactive exercises
))Loading Packages
Code
library(quanteda)
library(dplyr)
library(stringr)
library(writexl)
library(here)
library(flextable)
library(tidyr)
library(ggplot2)
library(checkdown)Always load all packages at the very top of your script, before any analysis code. This makes dependencies immediately visible to anyone reading or reusing your script and avoids the common frustration of hitting a “function not found” error halfway through a long analysis.
Loading and Preparing Text
What you will learn: How to load a pre-saved text file into R; what raw text data looks like and why it requires preprocessing; how to clean text for concordancing using stringr; and why data preparation is an essential — not optional — step in any text analysis
Loading Alice in Wonderland
We use Lewis Carroll’s Alice’s Adventures in Wonderland throughout the practical sections of this tutorial. This classic novel provides rich literary language, memorable characters and constructions, sufficient length for meaningful frequency patterns, and is freely available in the public domain via Project Gutenberg.
The text has been pre-downloaded and saved as an .rda file in the tutorial’s data/ folder so that the tutorial renders without requiring an active internet connection at knit time.
Code
rawtext <- readRDS(here::here("tutorials/kwics/data/alice.rda")) # loads object: rawtextLet us inspect the first non-empty lines to understand the structure of the raw file:
Code
rawtext[rawtext != ""] |> head(25) [1] "Alice’s Adventures in Wonderland"
[2] "by Lewis Carroll"
[3] "CHAPTER I."
[4] "Down the Rabbit-Hole"
[5] "Alice was beginning to get very tired of sitting by her sister on the"
[6] "bank, and of having nothing to do: once or twice she had peeped into"
[7] "the book her sister was reading, but it had no pictures or"
[8] "conversations in it, “and what is the use of a book,” thought Alice"
[9] "“without pictures or conversations?”"
[10] "So she was considering in her own mind (as well as she could, for the"
[11] "hot day made her feel very sleepy and stupid), whether the pleasure of"
[12] "making a daisy-chain would be worth the trouble of getting up and"
[13] "picking the daisies, when suddenly a White Rabbit with pink eyes ran"
[14] "close by her."
[15] "There was nothing so _very_ remarkable in that; nor did Alice think it"
[16] "so _very_ much out of the way to hear the Rabbit say to itself, “Oh"
[17] "dear! Oh dear! I shall be late!” (when she thought it over afterwards,"
[18] "it occurred to her that she ought to have wondered at this, but at the"
[19] "time it all seemed quite natural); but when the Rabbit actually _took a"
[20] "watch out of its waistcoat-pocket_, and looked at it, and then hurried"
[21] "on, Alice started to her feet, for it flashed across her mind that she"
[22] "had never before seen a rabbit with either a waistcoat-pocket, or a"
[23] "watch to take out of it, and burning with curiosity, she ran across the"
[24] "field after it, and fortunately was just in time to see it pop down a"
[25] "large rabbit-hole under the hedge."
The output reveals several features typical of Project Gutenberg files: a title page, legal boilerplate, a table of contents, and chapter headings, all preceding the actual narrative text. These must be handled before analysis.
Cleaning the Text
In practice, researchers typically spend around 80% of their time cleaning and preparing data and 20% analysing it. This is not wasted effort — it is essential investment. Contaminated or inconsistently formatted data produces unreliable concordance results. Poor preparation leads to missed matches, false matches, and results that cannot be replicated.
We apply three cleaning steps in sequence:
Code
text <- rawtext |>
# 1. Collapse all lines into one continuous string
paste0(collapse = " ") |>
# 2. Normalise whitespace (multiple spaces → single space)
stringr::str_squish() |>
# 3. Remove Project Gutenberg header up to and including "CHAPTER I."
stringr::str_remove(".*CHAPTER I\\.")Code
substr(text, 1, 600)[1] " Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?” So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran"
What each step does:
paste0(collapse = " ") collapses all the separate line elements of the vector into a single continuous string. This is necessary because kwic() works best on continuous text rather than line-by-line vectors.
stringr::str_squish() removes leading and trailing whitespace and reduces any internal sequence of whitespace characters (spaces, tabs, line breaks) to a single space. This standardises spacing throughout the document.
stringr::str_remove(".*CHAPTER I\\.") removes everything from the beginning of the string up to and including the literal text “CHAPTER I.” The .* matches any characters (including newlines if dotall is set), and \\. is an escaped period (. is a special regex character meaning “any character”, so we must escape it to mean a literal full stop).
Q5. You load a Project Gutenberg text and collapse it with paste0(collapse = " "). Before running str_squish(), you notice the collapsed string contains stretches like "Chapter I Down the Rabbit-Hole". What does str_squish() do to this, and why is this normalisation important for concordancing?
Q6. A colleague skips the str_remove(".*CHAPTER I\\.") step and runs the concordance directly on the uncleaned text. What specific problems might this cause?
Creating Concordances with quanteda
What you will learn: How quanteda::kwic() works; the structure of the KWIC output table; how to control the context window size; how to search for single words, phrases, and case-insensitive patterns; and how to interpret and count concordance results
Basic KWIC Extraction
The kwic() function is quanteda’s concordancing engine. It requires a tokens object as its first argument — raw text must be passed through quanteda::tokens() before it can be concordanced.
Code
mykwic <- quanteda::kwic(
quanteda::tokens(text),
pattern = quanteda::phrase("Alice")
) |>
as.data.frame()docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
text1 | 4 | 4 | Down the Rabbit-Hole | Alice | was beginning to get very | Alice |
text1 | 63 | 63 | a book , ” thought | Alice | “ without pictures or conversations | Alice |
text1 | 143 | 143 | in that ; nor did | Alice | think it so _very_ much | Alice |
text1 | 229 | 229 | and then hurried on , | Alice | started to her feet , | Alice |
text1 | 299 | 299 | In another moment down went | Alice | after it , never once | Alice |
text1 | 338 | 338 | down , so suddenly that | Alice | had not a moment to | Alice |
text1 | 521 | 521 | “ Well ! ” thought | Alice | to herself , “ after | Alice |
text1 | 647 | 647 | for , you see , | Alice | had learnt several things of | Alice |
The Concordance Table Structure
The output is a data frame with six key columns:
| Column | Contents |
|---|---|
docname |
Source document name (useful for multi-text corpora) |
from |
Start position (token index) of the match |
to |
End position (token index) of the match |
pre |
Left context — tokens to the left of the keyword |
keyword |
The matched token(s) |
post |
Right context — tokens to the right of the keyword |
The pre and post columns contain the context window. By default, kwic() returns 5 tokens on each side.
Counting Matches
Code
nrow(mykwic)[1] 386
There are 386 occurrences of “Alice” in the text. We can also see what exact forms were matched:
Code
table(mykwic$keyword)
Alice
386
Adjusting the Context Window
The window argument controls how many tokens appear on each side of the keyword. The default is 5; wider windows provide more context for interpretation, while narrower windows are better for studying immediate collocates.
Code
mykwic_wide <- quanteda::kwic(
quanteda::tokens(text),
pattern = quanteda::phrase("Alice"),
window = 10
) |>
as.data.frame()
head(mykwic_wide, 4) docname from to pre keyword
1 text1 4 4 Down the Rabbit-Hole Alice
2 text1 63 63 what is the use of a book , ” thought Alice
3 text1 143 143 was nothing so _very_ remarkable in that ; nor did Alice
4 text1 229 229 and looked at it , and then hurried on , Alice
post pattern
1 was beginning to get very tired of sitting by her Alice
2 “ without pictures or conversations ? ” So she was Alice
3 think it so _very_ much out of the way to Alice
4 started to her feet , for it flashed across her Alice
Guidelines for window size:
- 3 tokens — immediate collocates; tight focus on the word’s closest neighbours
- 5 tokens (default) — a good general starting point; captures most relevant local context
- 10–15 tokens — sentence-level context; useful for pragmatic and discourse analysis
- 20+ tokens — paragraph-level context; rarely needed and can make concordance lines unwieldy
Phrase Search
quanteda::phrase() tells kwic() to treat a multi-word expression as a single search unit. Without phrase(), each word would be searched independently.
Code
kwic_pooralice <- quanteda::kwic(
quanteda::tokens(text),
pattern = quanteda::phrase("poor Alice")
) |>
as.data.frame()
nrow(kwic_pooralice)[1] 11
Code
head(kwic_pooralice, 5) docname from to pre keyword post
1 text1 1541 1542 go through , ” thought poor Alice , “ it would be
2 text1 2130 2131 ; but , alas for poor Alice ! when she got to
3 text1 2332 2333 use now , ” thought poor Alice , “ to pretend to
4 text1 2887 2888 to the garden door . Poor Alice ! It was as much
5 text1 3604 3605 right words , ” said poor Alice , and her eyes filled
pattern
1 poor Alice
2 poor Alice
3 poor Alice
4 poor Alice
5 poor Alice
Case-Insensitive Search
By default, kwic() is case-sensitive. Setting case_insensitive = TRUE matches all capitalisation variants:
Code
kwic_ci <- quanteda::kwic(
quanteda::tokens(text),
pattern = quanteda::phrase("alice"),
case_insensitive = TRUE
) |>
as.data.frame()
# Compare case-sensitive vs case-insensitive counts
cat("Case-sensitive matches:", nrow(mykwic), "\n")Case-sensitive matches: 386
Code
cat("Case-insensitive matches:", nrow(kwic_ci), "\n")Case-insensitive matches: 386
Searching Multiple Patterns Simultaneously
Pass a character vector to pattern (with phrase()) to search for several expressions in one call:
Code
alice_variants <- c("poor Alice", "little Alice", "dear Alice")
kwic_variants <- quanteda::kwic(
quanteda::tokens(text),
pattern = quanteda::phrase(alice_variants)
) |>
as.data.frame()
table(kwic_variants$keyword)
little Alice poor Alice Poor Alice
3 10 1
Q7. You run kwic(tokens(text), pattern = phrase("the Hatter")) and get 37 results. You then run kwic(tokens(text), pattern = phrase("the hatter"), case_insensitive = TRUE) and get 42 results. What explains the 5 additional matches in the second run?
Q8. What is the difference between kwic(tokens(text), pattern = "poor Alice") and kwic(tokens(text), pattern = phrase("poor Alice"))?
Exporting Concordances
What you will learn: How to export concordances to Excel, CSV, and R’s native .rds format; how to use here() for portable file paths; and how to create formatted concordance tables for reports and presentations
Exporting to Excel
Excel is the most widely compatible format for sharing concordances with colleagues who do not use R:
Code
writexl::write_xlsx(mykwic, here::here("output", "alice_concordance.xlsx"))here() for File Paths
The here package constructs file paths relative to your RStudio project root, making your code portable across operating systems and user accounts:
# Hard-coded path — breaks on any other machine
write_xlsx(data, "C:/Users/Martin/Documents/project/output/file.xlsx")
# here() path — works anywhere the project folder is
write_xlsx(data, here::here("output", "file.xlsx"))Always use here::here() in scripts you intend to share.
Other Export Formats
Code
# CSV — universal plain-text format, ideal for version control
write.csv(mykwic, here::here("output", "concordance.csv"), row.names = FALSE)
# Tab-separated — handles commas in text better than CSV
write.table(mykwic, here::here("output", "concordance.tsv"),
sep = "\t", row.names = FALSE)
# R native format — preserves all R object attributes, smallest file size
saveRDS(mykwic, here::here("output", "concordance.rds"))
# Reload later
mykwic_reloaded <- readRDS(here::here("output", "concordance.rds"))| Format | Best for |
|---|---|
.xlsx |
Sharing with non-R users; easy manual annotation |
.csv |
Plain-text exchange; version control (Git-friendly) |
.tsv |
Large texts with commas in content |
.rds |
R-to-R sharing; preserves object class and attributes |
Formatted Tables for Reports
For presentations and papers, use flextable to create publication-ready concordance displays:
Code
mykwic |>
head(10) |>
dplyr::select(pre, keyword, post) |>
flextable::flextable() |>
flextable::set_table_properties(width = .95, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 10) |>
flextable::set_caption(caption = "Concordance of 'Alice' — first 10 instances.") |>
flextable::border_outer()pre | keyword | post |
|---|---|---|
Down the Rabbit-Hole | Alice | was beginning to get very |
a book , ” thought | Alice | “ without pictures or conversations |
in that ; nor did | Alice | think it so _very_ much |
and then hurried on , | Alice | started to her feet , |
In another moment down went | Alice | after it , never once |
down , so suddenly that | Alice | had not a moment to |
“ Well ! ” thought | Alice | to herself , “ after |
for , you see , | Alice | had learnt several things of |
got to ? ” ( | Alice | had no idea what Latitude |
else to do , so | Alice | soon began talking again . |
Regular Expressions for Pattern Matching
What you will learn: What regular expressions are and why they are essential for flexible concordancing; the three main categories of regex operator (frequency quantifiers, character classes, position anchors); how to use valuetype = "regex" in kwic(); and how to apply regex to find morphological word families, words of specific structures, and complex patterns
What Are Regular Expressions?
A regular expression (regex) is a sequence of characters that describes a search pattern. Where a literal search finds only the exact string specified, a regex can match an entire family of strings defined by a structural rule. For concordancing, this means that instead of running separate searches for walk, walks, walked, and walking, a single regex \\bwalk\\w* finds all of them at once.
Regular expressions operate through three main types of operator:
Frequency Quantifiers
These control how many times a unit must appear:
Symbol | Meaning | Example |
|---|---|---|
? | Preceding item is optional (0 or 1 times) | colou?r → colour, color |
* | Preceding item appears 0 or more times | walk\w* → walk, walks, walked, walking |
+ | Preceding item appears 1 or more times | walk\w+ → walks, walked, walking (not bare walk) |
{n} | Preceding item appears exactly n times | \w{5} → any 5-letter word |
{n,} | Preceding item appears at least n times | \w{5,} → words of 5 or more letters |
{n,m} | Preceding item appears between n and m times | \w{4,6} → words of 4, 5 or 6 letters |
Character Classes
These represent sets of characters:
Symbol | Meaning |
|---|---|
[ab] | Literal a or b |
[A-Z] | Any uppercase letter A through Z |
[0-9] | Any digit 0 through 9 |
[:digit:] | Any digit (equivalent to [0-9]) |
[:lower:] | Any lowercase letter |
[:upper:] | Any uppercase letter |
[:alpha:] | Any letter (upper or lower) |
[:alnum:] | Any letter or digit |
[:punct:] | Any punctuation character |
. | Any single character (wildcard) |
Position Anchors
These constrain where in a string the pattern must match:
Symbol | Meaning | Example |
|---|---|---|
\\b | Word boundary (between a word character and a non-word character) | \\brun\\b matches run but not running or rerun |
\\B | Non-boundary (inside a word, between two word characters) | \\Brun\\B matches the run in rerunning but not standalone run |
^ | Start of the string | ^Alice matches Alice only at the very start of the text |
$ | End of the string | Alice$ matches Alice only at the very end of the text |
Using Regex in kwic()
Set valuetype = "regex" to activate regular expression matching:
Code
# Find all words beginning with "alic" OR "hatt"
kwic_regex <- quanteda::kwic(
quanteda::tokens(text),
pattern = "\\b(alic|hatt)\\w*",
valuetype = "regex"
) |>
as.data.frame()
table(kwic_regex$keyword)
Alice Alice’s hatter Hatter Hatter’s hatters
386 10 1 54 1 1
Pattern breakdown:
\\b— word boundary: the match must begin at the start of a word(alic|hatt)— alternation: either “alic” or “hatt”\\w*— zero or more word characters: captures any continuation of the stem
docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
text1 | 4 | 4 | Down the Rabbit-Hole | Alice | was beginning to get very | \b(alic|hatt)\w* |
text1 | 63 | 63 | a book , ” thought | Alice | “ without pictures or conversations | \b(alic|hatt)\w* |
text1 | 143 | 143 | in that ; nor did | Alice | think it so _very_ much | \b(alic|hatt)\w* |
text1 | 229 | 229 | and then hurried on , | Alice | started to her feet , | \b(alic|hatt)\w* |
text1 | 299 | 299 | In another moment down went | Alice | after it , never once | \b(alic|hatt)\w* |
text1 | 338 | 338 | down , so suddenly that | Alice | had not a moment to | \b(alic|hatt)\w* |
text1 | 521 | 521 | “ Well ! ” thought | Alice | to herself , “ after | \b(alic|hatt)\w* |
text1 | 647 | 647 | for , you see , | Alice | had learnt several things of | \b(alic|hatt)\w* |
Common Regex Concordance Patterns
Code
# All morphological forms of "think"
kwic(tokens(text), pattern = "\\bthink\\w*", valuetype = "regex")
# Matches: think, thinks, thinking, thinker, thought (if you add |thought)
# Words ending in "-tion"
kwic(tokens(text), pattern = "\\w+tion\\b", valuetype = "regex")
# Words ending in "-ing"
kwic(tokens(text), pattern = "\\w+ing\\b", valuetype = "regex")
# Exactly four-letter words
kwic(tokens(text), pattern = "\\b\\w{4}\\b", valuetype = "regex")
# Words of ten or more letters
kwic(tokens(text), pattern = "\\b\\w{10,}\\b", valuetype = "regex")
# Words beginning with un- (negative prefix)
kwic(tokens(text), pattern = "\\bun\\w+", valuetype = "regex")
# Words beginning with a vowel
kwic(tokens(text), pattern = "\\b[aeiou]\\w*", valuetype = "regex")Regular expressions can be deceptively fragile. Small errors produce patterns that match either too much (returning thousands of false positives) or too little (missing the intended targets silently). Best practice is to test your pattern on a small sample using stringr::str_detect() or an online regex tester (e.g., regex101.com) before running it on a full corpus. Always inspect a random sample of the results to verify the pattern is working as intended.
Q9. A researcher wants to find all words in the Alice text that begin with the prefix re- (such as return, repeat, remember). She writes the pattern "re\\w*" with valuetype = "regex". What is the problem with this pattern, and how should it be fixed?
Q10. You want to find all words in the text that end in either -ful or -less (e.g. careful, careless, wonderful, hopeless). Write the regex pattern you would use.
Filtering and Sorting Concordances
What you will learn: How to use dplyr pipelines to filter concordance output by context patterns; how to sort concordances alphabetically and by collocate frequency; how to extract the immediate left and right neighbours of a keyword; and why sorting and filtering are essential for moving from raw concordance output to linguistic insight
Why Filtering and Sorting Matter
A raw concordance of a high-frequency word in a full-length novel may contain hundreds or thousands of lines. The analytical work of concordancing only begins once the raw output has been filtered to relevant instances and sorted in ways that group similar contexts together. Two operations are central to this:
Filtering restricts the concordance to lines meeting a specified context condition — for example, instances of said preceded by a character name, or instances of very followed by an adjective. Filtering uses dplyr::filter() in combination with stringr::str_detect().
Sorting reorders the concordance lines to surface patterns. Alphabetical sorting by the word immediately following the keyword groups lines that share the same collocate. Frequency sorting prioritises the most common collocates, immediately revealing the strongest patterns in the data.
Filtering by Context Pattern
The following pipeline finds instances of alice only when the immediately preceding word is poor or little:
Code
kwic_filtered <- quanteda::kwic(
x = quanteda::tokens(text),
pattern = "alice",
case_insensitive = TRUE
) |>
as.data.frame() |>
# Keep only lines where the last word of the left context is "poor" or "little"
dplyr::filter(stringr::str_detect(pre, "(poor|little)$"))
nrow(kwic_filtered)[1] 13
docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
text1 | 1,542 | 1,542 | through , ” thought poor | Alice | , “ it would be | alice |
text1 | 1,725 | 1,725 | ” but the wise little | Alice | was not going to do | alice |
text1 | 2,131 | 2,131 | but , alas for poor | Alice | ! when she got to | alice |
text1 | 2,333 | 2,333 | now , ” thought poor | Alice | , “ to pretend to | alice |
text1 | 3,605 | 3,605 | words , ” said poor | Alice | , and her eyes filled | alice |
text1 | 6,877 | 6,877 | it ! ” pleaded poor | Alice | . “ But you’re so | alice |
text1 | 7,291 | 7,291 | ! ” And here poor | Alice | began to cry again , | alice |
text1 | 8,240 | 8,240 | home , ” thought poor | Alice | , “ when one wasn’t | alice |
text1 | 11,789 | 11,789 | it ! ” pleaded poor | Alice | in a piteous tone . | alice |
text1 | 19,142 | 19,142 | This answer so confused poor | Alice | , that she let the | alice |
Pattern anatomy: str_detect(pre, "(poor|little)$") checks whether the pre column (the left context string) ends with ($) either poor or little. The $ anchor is important: without it, the filter would also match lines where poor or little appears earlier in the left context but is not the immediately preceding word.
Multiple Conditions
Conditions can be combined with & (AND) or | (OR):
Code
# Find "said" preceded by a character name AND followed by a comma or period
kwic_said <- quanteda::kwic(
x = quanteda::tokens(text),
pattern = "said"
) |>
as.data.frame() |>
dplyr::filter(
stringr::str_detect(pre, "(Alice|Hatter|Queen|King|Cat|Mouse)$"),
stringr::str_detect(post, "^[,.]")
)
head(kwic_said, 8) docname from to pre keyword
1 text1 15850 15850 direction , ” the Cat said
2 text1 17618 17618 yet ? ” the Hatter said
3 text1 17747 17747 don’t ! ” the Hatter said
4 text1 30643 30643 must , ” the King said
5 text1 31312 31312 important , ” the King said
6 text1 32748 32748 verdict , ” the King said
post pattern
1 , waving its right paw said
2 , turning to Alice again said
3 , tossing his head contemptuously said
4 , with a melancholy air said
5 , turning to the jury said
6 , for about the twentieth said
This pipeline extracts speech acts of the form “[Character name] said, …” — a common construction in narrative fiction.
Alphabetical Sorting
Sorting alphabetically by the right context groups together lines with the same immediately following word, making collocational patterns immediately visible:
Code
kwic_sorted_alpha <- quanteda::kwic(
x = quanteda::tokens(text),
pattern = "alice",
case_insensitive = TRUE
) |>
as.data.frame() |>
dplyr::arrange(post)
head(kwic_sorted_alpha, 8) docname from to pre keyword post
1 text1 7754 7754 happen : “ ‘ Miss Alice ! Come here directly ,
2 text1 2888 2888 the garden door . Poor Alice ! It was as much
3 text1 2131 2131 but , alas for poor Alice ! when she got to
4 text1 30891 30891 voice , the name “ Alice ! ” CHAPTER XII .
5 text1 8423 8423 “ Oh , you foolish Alice ! ” she answered herself
6 text1 2606 2606 and curiouser ! ” cried Alice ( she was so much
7 text1 25861 25861 I haven’t , ” said Alice ) — “ and perhaps
8 text1 32275 32275 explain it , ” said Alice , ( she had grown
pattern
1 alice
2 alice
3 alice
4 alice
5 alice
6 alice
7 alice
8 alice
Frequency Sorting
Frequency sorting identifies the most common collocates — the words that appear most often immediately before or after the keyword:
Code
kwic_sorted_freq <- quanteda::kwic(
x = quanteda::tokens(text),
pattern = "alice",
case_insensitive = TRUE
) |>
as.data.frame() |>
# Extract the first word of the right context
dplyr::mutate(post1 = stringr::str_extract(post, "^\\w+")) |>
# Count how often each right-context word occurs
dplyr::add_count(post1, name = "post1_freq") |>
# Sort from most to least frequent
dplyr::arrange(dplyr::desc(post1_freq))
head(kwic_sorted_freq |> dplyr::select(pre, keyword, post, post1, post1_freq), 12) pre keyword post post1
1 a book , ” thought Alice “ without pictures or conversations <NA>
2 through , ” thought poor Alice , “ it would be <NA>
3 here before , ” said Alice , ) and round the <NA>
4 curious feeling ! ” said Alice ; “ I must be <NA>
5 but , alas for poor Alice ! when she got to <NA>
6 now , ” thought poor Alice , “ to pretend to <NA>
7 eat it , ” said Alice , “ and if it <NA>
8 and curiouser ! ” cried Alice ( she was so much <NA>
9 to them , ” thought Alice , “ or perhaps they <NA>
10 the garden door . Poor Alice ! It was as much <NA>
11 of yourself , ” said Alice , “ a great girl <NA>
12 words , ” said poor Alice , and her eyes filled <NA>
post1_freq
1 163
2 163
3 163
4 163
5 163
6 163
7 163
8 163
9 163
10 163
11 163
12 163
Code
# Summary table: top 10 right-context collocates
kwic_sorted_freq |>
dplyr::distinct(post1, post1_freq) |>
dplyr::arrange(dplyr::desc(post1_freq)) |>
head(10) |>
flextable::flextable() |>
flextable::set_table_properties(width = .4, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::set_caption(caption = "Top 10 words immediately following 'alice'.") |>
flextable::border_outer()post1 | post1_freq |
|---|---|
163 | |
was | 17 |
thought | 12 |
had | 11 |
said | 11 |
could | 11 |
replied | 9 |
did | 9 |
looked | 8 |
to | 7 |
Extracting N-gram Collocates
It is often useful to extract not just the immediately adjacent word but the first two or three words on each side, building a more complete collocational profile:
Code
kwic_ngram <- quanteda::kwic(
x = quanteda::tokens(text),
pattern = "alice",
case_insensitive = TRUE,
window = 5
) |>
as.data.frame() |>
dplyr::rowwise() |>
dplyr::mutate(
post1 = stringr::str_split(post, " ")[[1]][1],
post2 = stringr::str_split(post, " ")[[1]][2],
pre1 = dplyr::last(stringr::str_split(pre, " ")[[1]]),
pre2 = rev(stringr::str_split(pre, " ")[[1]])[2]
) |>
dplyr::ungroup()
# Most common bigrams following "alice"
kwic_ngram |>
dplyr::count(post1, post2, sort = TRUE) |>
head(10) |>
flextable::flextable() |>
flextable::set_table_properties(width = .4, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::set_caption(caption = "Most frequent bigrams immediately following 'alice'.") |>
flextable::border_outer()post1 | post2 | n |
|---|---|---|
. | “ | 50 |
, | “ | 20 |
; | “ | 13 |
, | and | 9 |
did | not | 9 |
, | as | 6 |
, | who | 6 |
: | “ | 6 |
in | a | 6 |
to | herself | 6 |
Q11. You extract a concordance of the word thought in the Alice text and want to find only instances where the preceding context ends with she (i.e. “she thought”). Which dplyr::filter() call achieves this?
Q12. After sorting a concordance of very by frequency of the immediately following word, you find that much is the most frequent right-context collocate. What does this tell you about the grammatical behaviour of very in the text, and what follow-up analysis might you do?
Working with Spoken Transcripts
What you will learn: How spoken language transcripts differ structurally from written text; how to load and preprocess ICE-Ireland transcript files; how to handle annotation markup in concordancing; and how to extract and analyse discourse markers in spoken data
Characteristics of Spoken Transcripts
Spoken language transcripts differ from written texts in several ways that require adapted preprocessing:
Speaker turn structure — transcripts are organised by speaker turns, often with speaker IDs encoded in annotation tags.
Paralinguistic markers — laughter, coughing, pauses, and overlaps are typically encoded using special markup: <,> for pauses, <&> laughter </&> for paralinguistic events.
Incomplete and non-standard forms — spoken language contains false starts, filled pauses (uh, um), and incomplete utterances that do not correspond to standard written sentences.
Metadata headers — corpus transcripts typically begin with file-level metadata: recording date, speaker demographics, and topic information.
These features mean that the same preprocessing pipeline used for written texts will not work cleanly for transcripts. The approach must be adapted to the specific annotation conventions of the corpus being used.
Loading ICE-Ireland Transcripts
We work with a sample of five files from the spoken dialogue section of the International Corpus of English — Irish component:
Code
files <- paste0("tutorials/kwics/data/ICEIrelandSample/S1A-00", 1:5, ".txt")
transcripts <- sapply(files, readLines, USE.NAMES = TRUE)Code
# Inspect the first 12 lines of the first file
transcripts[[1]][1:12] [1] "<S1A-001 Riding>"
[2] ""
[3] "<I>"
[4] "<S1A-001$A> <#> Well how did the riding go tonight"
[5] "<S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&>"
[6] "<S1A-001$A> <#> What did you call your horse"
[7] "<S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh"
[8] "<S1A-001$A> <#> And how did Mabel do"
[9] "<S1A-001$B> <#> Did you not see her whenever she was going over the jumps <#> There was one time her horse refused and it refused three times <#> And then <,> she got it round and she just lined it up straight and she just kicked it and she hit it with the whip <,> and over it went the last time you know <#> And Stephanie told her she was very determined and very well-ridden <&> laughter </&> because it had refused the other times you know <#> But Stephanie wouldn't let her give up on it <#> She made her keep coming back and keep coming back <,> until <,> it jumped it you know <#> It was good"
[10] "<S1A-001$A> <#> Yeah I 'm not so sure her jumping 's improving that much <#> She uh <,> seemed to be holding the reins very tight"
[11] "<S1A-001$B> <#> Yeah she was <#> That 's what Stephanie said <#> <{> <[> She </[> needed to <,> give the horse its head"
[12] "<S1A-001$A> <#> <[> Mm </[> </{>"
The output reveals the ICE annotation conventions:
<S1A-001 Riding>— file header with ID and title<I>— transcript start marker<S1A-001$A>— speaker A in file 001<#>— speech unit boundary<,>— pause<&> laughter </&>— paralinguistic event
Collapsing and Cleaning Transcripts
For basic concordancing we collapse each transcript to a single string and normalise whitespace. We retain the annotation tags for now — they can be used later to extract speaker-specific data:
Code
transcripts_collapsed <- sapply(transcripts, function(x) {
x |>
paste0(collapse = " ") |>
stringr::str_squish()
})
# Preview the first 400 characters of the first transcript
substr(transcripts_collapsed[[1]], 1, 400)[1] "<S1A-001 Riding> <I> <S1A-001$A> <#> Well how did the riding go tonight <S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&> <S1A-001$A> <#> What did you call your horse <S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh <S1A-001$A> <#> And how did Mabel do <S1A-0"
Concordancing Transcripts
We search for the discourse marker you know, a high-frequency pragmatic expression in spoken Irish English used to signal common ground, manage turn-taking, and mark informational status:
Code
kwic_youknow <- quanteda::kwic(
# Use "fasterword" tokeniser to preserve annotation tags as tokens
quanteda::tokens(transcripts_collapsed, what = "fasterword"),
pattern = quanteda::phrase("you know"),
window = 10
) |>
as.data.frame() |>
# Clean up document names to show only the file ID
dplyr::mutate(docname = stringr::str_extract(docname, "S1A-\\d+"))docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
S1A-001 | 42 | 43 | let me jump <,> that was only the fourth time | you know | <#> It was great <&> laughter </&> <S1A-001$A> <#> What | you know |
S1A-001 | 140 | 141 | the whip <,> and over it went the last time | you know | <#> And Stephanie told her she was very determined and | you know |
S1A-001 | 164 | 165 | <&> laughter </&> because it had refused the other times | you know | <#> But Stephanie wouldn't let her give up on it | you know |
S1A-001 | 193 | 194 | and keep coming back <,> until <,> it jumped it | you know | <#> It was good <S1A-001$A> <#> Yeah I 'm not | you know |
S1A-001 | 402 | 403 | 'd be far better waiting <,> for that one <,> | you know | and starting anew fresh <S1A-001$A> <#> Yeah but I mean | you know |
S1A-001 | 443 | 444 | the best goes top of the league <,> <{> <[> | you know | </[> <S1A-001$A> <#> <[> So </[> </{> it 's like | you know |
S1A-001 | 484 | 485 | I 'm not sure now <#> We didn't discuss it | you know | <S1A-001$A> <#> Well it sounds like more money <S1A-001$B> <#> | you know |
S1A-001 | 598 | 599 | on Monday and do without her lesson on Tuesday <,> | you know | <#> But I was keeping her going cos I says | you know |
S1A-001 | 727 | 728 | to take it tomorrow <,> that she could take her | you know | the wee shoulder bag she has <S1A-001$A> <#> Mhm <S1A-001$B> | you know |
S1A-001 | 808 | 809 | <,> and <,> sort of show them around <,> uhm | you know | their timetable and <,> give them their timetable and show | you know |
what = "fasterword"?
The default quanteda::tokens() tokeniser strips punctuation and special characters. For annotated transcripts, this would remove the <#>, <,>, and speaker tags — information that may be important for the analysis. what = "fasterword" tokenises by whitespace only, preserving tags as tokens. A wider context window (here, 10) is also appropriate for spoken data because pauses, tags, and fillers occupy tokens within the window.
Distribution Across Files
We can compare the frequency of you know across the five files:
Code
kwic_youknow |>
dplyr::count(docname, name = "n_youknow") |>
dplyr::arrange(dplyr::desc(n_youknow)) |>
flextable::flextable() |>
flextable::set_table_properties(width = .35, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::set_caption(caption = "Frequency of 'you know' per file.") |>
flextable::border_outer()docname | n_youknow |
|---|---|
S1A-001 | 18 |
S1A-005 | 15 |
S1A-002 | 14 |
S1A-004 | 14 |
S1A-003 | 13 |
A Custom Concordance Function
What you will learn: How quanteda::kwic() works internally; how to build an improved custom concordance function with character-based (rather than token-based) context windows, structured output with named columns, and input validation; and when a custom function provides capabilities that kwic() does not
Why Build a Custom Function?
quanteda::kwic() is an excellent, well-tested concordancer for the vast majority of use cases. There are, however, situations where a custom function is useful:
- Character-based context windows —
kwic()measures context in tokens. For some analyses (e.g. studies of typographic or layout features, or working with very short texts), a character-count window is more appropriate. - Fine-grained output control — a custom function can return exactly the columns and naming conventions your workflow requires without post-hoc renaming.
- Educational transparency — writing the function from scratch makes the mechanics of concordancing explicit and demystifiable.
- Integration with non-standard inputs — for data sources that do not fit naturally into quanteda’s corpus/tokens workflow, a character-level function may be simpler.
The Improved Custom Function
The function below improves on the approach in the original draft in several ways:
- Input validation — it checks that the text and pattern are non-empty and the context length is positive, issuing informative error messages if not.
- Vectorised output — it returns a properly typed
tibblerather than a matrix-derived data frame. - Named, clean columns —
Left,Node,Right,DocID, andMatchIDare returned for clarity. - Multiple document support — the function accepts a named vector and records the source document for each hit.
- Safe handling of edge positions — matches near the start or end of a text are handled without errors by clamping the window to the text boundaries.
Code
concordance <- function(texts, pattern, context = 80,
ignore_case = FALSE, perl = TRUE) {
# ── Input validation ──────────────────────────────────────────────────────
if (!is.character(texts) || length(texts) == 0)
stop("`texts` must be a non-empty character vector.")
if (!is.character(pattern) || length(pattern) != 1 || nchar(pattern) == 0)
stop("`pattern` must be a single non-empty character string (regex allowed).")
if (!is.numeric(context) || context < 1)
stop("`context` must be a positive integer (number of characters per side).")
context <- as.integer(context)
# ── Ensure texts are named (for DocID column) ─────────────────────────────
if (is.null(names(texts))) names(texts) <- paste0("doc", seq_along(texts))
# ── Process each document ─────────────────────────────────────────────────
results <- purrr::imap(texts, function(txt, doc_id) {
# Find all match positions
m <- gregexpr(pattern, txt,
ignore.case = ignore_case,
perl = perl)[[1]]
# No matches in this document
if (m[1] == -1) return(NULL)
match_starts <- as.integer(m)
match_lengths <- attr(m, "match.length")
match_ends <- match_starts + match_lengths - 1L
txt_len <- nchar(txt)
purrr::pmap_dfr(
list(match_starts, match_ends, seq_along(match_starts)),
function(ms, me, idx) {
# Character-based window, clamped to text boundaries
left_start <- max(1L, ms - context)
right_end <- min(txt_len, me + context)
left <- substr(txt, left_start, ms - 1L)
node <- substr(txt, ms, me)
right <- substr(txt, me + 1L, right_end)
# Trim to nearest word boundary for cleaner display
left <- stringr::str_remove(left, "^\\S*\\s") # remove partial first word
right <- stringr::str_remove(right, "\\s\\S*$") # remove partial last word
tibble::tibble(
DocID = doc_id,
MatchID = idx,
Left = left,
Node = node,
Right = right
)
}
)
})
# ── Combine and return ────────────────────────────────────────────────────
out <- dplyr::bind_rows(results)
if (nrow(out) == 0) {
message("No matches found for pattern: ", pattern)
return(tibble::tibble(DocID = character(),
MatchID = integer(),
Left = character(),
Node = character(),
Right = character()))
}
out
}Using the Custom Function
Code
# Search for "you know" with 60-character context windows
kwic_custom <- concordance(
texts = transcripts_collapsed,
pattern = "you know",
context = 60
)
nrow(kwic_custom)[1] 62
Code
head(kwic_custom, 6)# A tibble: 6 × 5
DocID MatchID Left Node Right
<chr> <int> <chr> <chr> <chr>
1 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 1 "was go… you … " <#…
2 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 2 "hit it… you … " <#…
3 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 3 "<&> la… you … " <#…
4 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 4 "back a… you … " <#…
5 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 5 "I said… you … " an…
6 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 6 "<,> wh… you … " </…
DocID | MatchID | Left | Node | Right |
|---|---|---|---|---|
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt | 1 | was going to let me jump <,> that was only the fourth time | you know | <#> It was great <&> laughter </&> <S1A-001$A> <#> What |
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt | 2 | hit it with the whip <,> and over it went the last time | you know | <#> And Stephanie told her she was very determined and |
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt | 3 | <&> laughter </&> because it had refused the other times | you know | <#> But Stephanie wouldn't let her give up on it <#> She |
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt | 4 | back and keep coming back <,> until <,> it jumped it | you know | <#> It was good <S1A-001$A> <#> Yeah I 'm not so sure her |
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt | 5 | I said she 'd be far better waiting <,> for that one <,> | you know | and starting anew fresh <S1A-001$A> <#> Yeah but I mean |
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt | 6 | <,> whoever 's the best goes top of the league <,> <{> <[> | you know | </[> <S1A-001$A> <#> <[> So </[> </{> it 's like another |
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt | 7 | I got <#> I 'm not sure now <#> We didn't discuss it | you know | <S1A-001$A> <#> Well it sounds like more money <S1A-001$B> |
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt | 8 | go on Monday and do without her lesson on Tuesday <,> | you know | <#> But I was keeping her going cos I says oh I wouldn't |
Regex Patterns and Case Insensitivity
The function accepts any R regular expression in its pattern argument:
Code
# Find "kind of" and "sort of" (hedges), case-insensitive
kwic_hedges <- concordance(
texts = transcripts_clean,
pattern = "\\b(kind|sort) of\\b",
context = 70,
ignore_case = TRUE
)
nrow(kwic_hedges)[1] 19
Code
head(kwic_hedges, 5)# A tibble: 5 × 5
DocID MatchID Left Node Right
<chr> <int> <chr> <chr> <chr>
1 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 1 "or any… sort… " sh…
2 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 2 "months… sort… " fl…
3 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 3 "them f… sort… " wa…
4 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt 4 "of the… sort… " wa…
5 tutorials/kwics/data/ICEIrelandSample/S1A-002.txt 1 "Mass l… kind… " lo…
Comparing Token-Based and Character-Based Windows
The key practical difference between kwic() and our custom function is the unit of the context window:
Code
# quanteda: 7-token window
kwic_token <- quanteda::kwic(
quanteda::tokens(transcripts_clean[[1]]),
pattern = quanteda::phrase("you know"),
window = 7
) |>
as.data.frame() |>
head(3) |>
dplyr::mutate(type = "token-based (7 tokens)")
# Custom: 50-character window
kwic_char <- concordance(
texts = transcripts_clean[1],
pattern = "you know",
context = 50
) |>
head(3) |>
dplyr::mutate(type = "char-based (50 chars)")
# Show the difference
cat("=== Token-based (7 tokens) ===\n")=== Token-based (7 tokens) ===
Code
kwic_token[1, c("pre", "keyword", "post")] pre keyword
1 jump that was only the fourth time you know
post
1 It was great laughter What did you
Code
cat("\n=== Character-based (50 chars) ===\n")
=== Character-based (50 chars) ===
Code
kwic_char[1, c("Left", "Node", "Right")]# A tibble: 1 × 3
Left Node Right
<chr> <chr> <chr>
1 "to let me jump that was only the fourth time " you know " It was great laugh…
With a token-based window, a text full of short function words will show more words on each side than a text with many long technical terms — because a token is a token regardless of length. With a character-based window, the amount of displayed text is constant regardless of token length, which can be preferable when the visual appearance of the concordance matters.
Q15. The custom concordance() function uses gregexpr() to find match positions and then calls substr() to extract context. What would happen if you removed the line that trims to the nearest word boundary, and why was that step added?
Q16. You want to use the custom concordance() function to find all instances of British English -ise spellings (e.g. recognise, organise, realise) in a corpus of newspaper editorials. Write the function call you would use.
Reproducible Workflows
What you will learn: How to organise a concordancing project for reproducibility; essential script documentation conventions; how to parameterise analyses for easy modification; and how to export results in formats that support open science
Project Structure
A well-organised project folder makes it easy to return to an analysis months later, share it with collaborators, or submit it alongside a manuscript:
my-concordance-project/
├── data/
│ ├── raw/ # Original texts — never edit these
│ └── processed/ # Cleaned texts and intermediate objects
├── scripts/
│ ├── 01-load-clean.R # Data loading and preprocessing
│ ├── 02-concordance.R # KWIC extraction
│ └── 03-analysis.R # Filtering, sorting, statistics
├── output/
│ ├── concordances/ # Saved KWIC tables
│ └── figures/ # Plots and visualisations
├── README.md # Project description
└── project.Rproj # RStudio project file
Script Documentation
A reproducible script begins with a header block and uses parameters rather than hard-coded values:
Code
# ============================================================
# Title: Concordance analysis of hedging in Alice in Wonderland
# Author: Martin Schweinberger
# Date: 2026-05-01
# Purpose: Extract and analyse instances of epistemic hedges
# ============================================================
library(quanteda)
library(dplyr)
library(writexl)
library(here)
# ── Parameters (change these, not the code below) ───────────
CONTEXT_WINDOW <- 7 # tokens per side
MIN_FREQ <- 3 # minimum collocate frequency to report
OUTPUT_DIR <- here::here("output", "concordances")
# ── Load and clean text ──────────────────────────────────────
load("tutorials/kwics/data/alice.rda") # loads object: rawtext
text_clean <- paste0(rawtext, collapse = " ") |>
stringr::str_squish() |>
stringr::str_remove(".*CHAPTER I\\.")
# ── Extract concordances ─────────────────────────────────────
kwic_hedge <- quanteda::kwic(
quanteda::tokens(text_clean),
pattern = quanteda::phrase(c("perhaps", "might", "seemed to",
"appeared to", "as if")),
window = CONTEXT_WINDOW
) |>
as.data.frame()
# ── Save results ──────────────────────────────────────────────
dir.create(OUTPUT_DIR, showWarnings = FALSE, recursive = TRUE)
writexl::write_xlsx(
kwic_hedge,
file.path(OUTPUT_DIR, paste0("hedges_kwic_", Sys.Date(), ".xlsx"))
)
# ── Session info for reproducibility ─────────────────────────
sessionInfo()Key practices demonstrated here:
- Parameterisation —
CONTEXT_WINDOW,MIN_FREQ, andOUTPUT_DIRare defined at the top so the entire analysis can be reconfigured by changing three lines. - Datestamped output —
Sys.Date()in the filename means each run creates a new output file, preserving the history of the analysis. sessionInfo()at the end — records the exact R and package versions used, essential for replication.
Common Mistakes and How to Avoid Them
1. Forgetting case sensitivity. kwic(tokens(text), phrase("alice")) returns zero results if the text uses “Alice”. Always check capitalisation conventions in your data, and use case_insensitive = TRUE when appropriate.
2. Using regex without valuetype = "regex". The pattern "walk.*" without valuetype = "regex" is treated as a literal glob pattern, not a regular expression. The result may be zero matches or unexpected results.
3. Skipping preprocessing. Running kwic() on uncleaned text means metadata, headers, and formatting artefacts contaminate the concordance.
4. Not inspecting results. Always examine a random sample of concordance lines to verify the pattern matched what you intended. Use sample_n(20) to spot-check.
5. Confusing phrase() and bare strings. Always use phrase() for multi-word search targets. Without it, quanteda may treat the multi-word string as a glob matching individual tokens.
Summary and Further Reading
This tutorial has provided a comprehensive introduction to concordancing with R, covering the conceptual foundations, the tool landscape, and the full practical workflow from loading text to exporting results.
Section 1 established what concordancing is — the systematic KWIC-display extraction of words in context — and why it is central to corpus linguistics: it grounds linguistic claims in observable, verifiable evidence, overcomes the limitations of native-speaker intuition and working memory, and enables both quantitative and qualitative analysis within a single framework.
Section 2 surveyed the concordancing tool landscape: desktop software (AntConc, WordSmith, SketchEngine), web corpus interfaces (COCA, COHA, Lextutor), and R with quanteda. The case for R rests on reproducibility, integration with statistics and visualisation, flexibility, and scalability.
Sections 3 and 4 covered data loading and preprocessing, with emphasis on the importance of cleaning (collapsing lines, normalising whitespace, removing headers) before concordancing.
Section 5 introduced quanteda::kwic() in depth: the structure of the KWIC output table, window size control, phrase search with phrase(), case-insensitive search, and multi-pattern search.
Section 6 introduced regular expressions as a tool for flexible concordancing: frequency quantifiers, character classes, position anchors, and common concordancing patterns including morphological word families, suffix-based searches, and word length filters.
Section 7 extended the workflow to filtering and sorting: using dplyr pipelines with str_detect() to restrict concordances to specific context conditions, and sorting by alphabetical and frequency criteria to surface collocational patterns.
Section 8 addressed spoken language transcripts, covering the structural differences between written and transcribed spoken text, how to handle annotation markup, and the discourse marker you know as a worked example.
Section 9 presented an improved custom concordance function using character-based context windows, input validation, and support for named multi-document input — extending kwic() for use cases that token-based windows cannot serve.
Section 10 demonstrated reproducible workflow conventions: project folder structure, parameterised scripts, datestamped output, and sessionInfo() documentation.
Further reading: Sinclair (1991) remains the foundational reference for concordance-based language description. McEnery and Hardie (2011) provides a comprehensive methodological overview of corpus linguistics. Stubbs (1996) is essential on collocation and semantic prosody. Anthony (2013) discusses tool choice for corpus work. Brezina (2018) covers the statistical analysis of concordance-derived collocations. For quanteda specifically, the package documentation and tutorials at tutorials.quanteda.io are the primary resource.
Citation & Session Info
Schweinberger, Martin. 2026. Finding Words in Text: Concordancing with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/kwics/kwics.html (Version 2026.05.01).
@manual{schweinberger2026kwics,
author = {Schweinberger, Martin},
title = {Finding Words in Text: Concordancing with R},
note = {tutorials/kwics/kwics.html},
year = {2026},
organization = {The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2026.05.01}
}
This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to substantially expand and restructure a shorter existing LADAL draft tutorial on concordancing. All content — including all R code — was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.
Code
sessionInfo()R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Australia/Brisbane
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] ggplot2_4.0.2 checkdown_0.0.13 tidyr_1.3.2 flextable_0.9.11
[5] here_1.0.2 writexl_1.5.1 stringr_1.5.1 dplyr_1.2.0
[9] quanteda_4.2.0
loaded via a namespace (and not attached):
[1] fastmatch_1.1-6 gtable_0.3.6 xfun_0.56
[4] htmlwidgets_1.6.4 lattice_0.22-6 vctrs_0.7.1
[7] tools_4.4.2 generics_0.1.3 tibble_3.2.1
[10] pkgconfig_2.0.3 Matrix_1.7-2 data.table_1.17.0
[13] RColorBrewer_1.1-3 S7_0.2.1 uuid_1.2-1
[16] lifecycle_1.0.5 compiler_4.4.2 farver_2.1.2
[19] textshaping_1.0.0 codetools_0.2-20 litedown_0.9
[22] fontquiver_0.2.1 fontLiberation_0.1.0 htmltools_0.5.9
[25] yaml_2.3.10 pillar_1.10.1 openssl_2.3.2
[28] fontBitstreamVera_0.1.1 commonmark_2.0.0 stopwords_2.3
[31] tidyselect_1.2.1 zip_2.3.2 digest_0.6.39
[34] stringi_1.8.4 purrr_1.0.4 rprojroot_2.1.1
[37] fastmap_1.2.0 grid_4.4.2 cli_3.6.4
[40] magrittr_2.0.3 patchwork_1.3.0 utf8_1.2.4
[43] withr_3.0.2 gdtools_0.5.0 scales_1.4.0
[46] rmarkdown_2.30 officer_0.7.3 askpass_1.2.1
[49] ragg_1.3.3 evaluate_1.0.3 knitr_1.51
[52] markdown_2.0 rlang_1.1.7 Rcpp_1.1.1
[55] glue_1.8.0 xml2_1.3.6 renv_1.1.7
[58] rstudioapi_0.17.1 jsonlite_1.9.0 R6_2.6.1
[61] systemfonts_1.3.1