Finding Words in Text: Concordancing with R

Author

Martin Schweinberger

Preparing the Tutorial Data File

The Alice text must be downloaded once and saved before knitting. Run this code once in your console (not in the tutorial itself):

rawtext <- readLines("https://www.gutenberg.org/files/11/11-0.txt")
dir.create("tutorials/kwics/data", recursive = TRUE, showWarnings = FALSE)
save(rawtext, file = "tutorials/kwics/data/alice.rda")

This creates tutorials/kwics/data/alice.rda, which the tutorial loads at knit time.

Introduction

This tutorial introduces concordancing — one of the most fundamental and powerful methods in corpus linguistics. Concordancing allows researchers to search systematically through large text collections, extracting every occurrence of a word or phrase together with the surrounding context. The resulting display, known as a keyword-in-context (KWIC) display, makes patterns of language use visible that would be impossible to detect through ordinary reading.

The tutorial covers the core concepts of concordancing, a survey of available tools from desktop software to web interfaces to R, and a hands-on practical guide to extracting, filtering, sorting, and analysing concordances using the quanteda package. It includes a section on working with spoken language transcripts and demonstrates how to build a custom concordance function that extends quanteda’s built-in capabilities.

Prerequisite Tutorials

Before working through this tutorial, we recommend familiarity with:

Learning Objectives

By the end of this tutorial you will be able to:

  1. Explain what concordancing is, how the KWIC display works, and why it is central to corpus linguistics
  2. Navigate the landscape of concordancing tools and choose the right tool for different research tasks
  3. Load and preprocess text data for concordancing in R
  4. Extract keyword-in-context concordances using quanteda::kwic()
  5. Use regular expressions to search for morphological variants and complex patterns
  6. Filter and sort concordances using dplyr pipelines to reveal collocational patterns
  7. Work with spoken language transcripts and handle annotation markup
  8. Build and use a custom concordance function for character-based context windows
  9. Export concordances in Excel, CSV, and other formats for further analysis
Citation

Schweinberger, Martin. 2026. Finding Words in Text: Concordancing with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/kwics/kwics.html (Version 2026.05.01).

LADAL Notebook Tool

An interactive Binder notebook that lets you upload your own texts and run the concordancing code without installing R is available here:

Click here to open the interactive concordancing notebook.


What Is Concordancing?

Section Overview

What you will learn: What a concordance is and how the KWIC display works; why concordancing is central to corpus linguistics; how concordances bridge quantitative and qualitative analysis; and a survey of application areas from linguistics to lexicography to digital humanities

The KWIC Display

A concordance is a systematic list of every occurrence of a search term in a text or corpus, presented with its surrounding context. The standard presentation format is the keyword-in-context (KWIC) display, in which the search term — the node word — appears aligned in the centre of each line, with a fixed window of context on either side:

...couldn't help thinking there  must   be more to life than being merely
...you are my density. I          must   go. Please excuse me. I mean, my
...the situation requires that we  must   work together to achieve our aims.
...extraordinary claims require    extraordinary   evidence before they are
...but the Emperor has no clothes!  must   speak truth to power and be heard

This deceptively simple layout makes linguistic patterns visible that are impossible to detect through ordinary reading. By displaying multiple instances simultaneously and aligning the node word vertically, concordances allow researchers to:

  • observe how words are actually used rather than how we imagine they are used
  • identify collocational patterns — words that systematically appear nearby
  • distinguish different meanings or senses of a polysemous word from context
  • examine grammatical constructions that a word participates in
  • compare register and genre variation across different text types

From Intuition to Evidence

One of the most important contributions of concordancing to linguistics is the systematic correction of native-speaker intuition. Speakers often hold confident but inaccurate beliefs about their own language use. Concordances provide observable, verifiable evidence that can confirm, nuance, or directly contradict such intuitions.

Some well-documented intuition-corpus mismatches include:

  • speakers typically believe they use very more than really, but corpus evidence frequently shows the reverse
  • formal writing is assumed to avoid contractions, yet concordances reveal they are common in specific formal genres
  • particular collocations assumed to be rare prove pervasive in specific registers once a corpus is examined

This empirical grounding is what distinguishes corpus linguistics from introspection-based approaches and makes concordancing indispensable for rigorous language research.

What Concordances Reveal

Concordancing supports several distinct types of linguistic investigation:

Semantic analysis — concordances allow researchers to identify the different senses of polysemous words, observe how context disambiguates meaning, and study semantic prosody: the tendency of words to collocate preferentially with words that carry a positive or negative evaluative charge. The word cause, for instance, collocates predominantly with negative nouns (cause harm, cause problems, cause damage) even though the word itself is semantically neutral.

Collocational analysis — the study of words that habitually co-occur is one of the core contributions of corpus linguistics to lexicography and language description. Concordances make it possible to build collocational profiles for words and to use association measures (mutual information, log-likelihood) to distinguish significant co-occurrences from accidental ones.

Grammatical investigation — concordances reveal verb complementation patterns (does this verb prefer an infinitival or a gerundive complement?), preposition selection, word order preferences, and evidence for grammaticalization processes.

Discourse and framing analysis — by examining what vocabulary surrounds key terms, researchers can study how concepts are constructed and contested in public discourse, identify ideological positioning through word choice, and track the evolution of discursive strategies over time.

Historical linguistics — applied to diachronic corpora, concordancing can document language change, track the semantic bleaching or expansion of words over centuries, and identify grammaticalization paths.

The Quantitative–Qualitative Bridge

Concordancing is unusual among research methods in that it supports both quantitative and qualitative analysis within a single workflow. The quantitative dimension includes frequency counts (how often does the pattern occur?), collocational strength statistics, and distributional comparisons across sub-corpora. The qualitative dimension involves close reading of individual concordance lines, interpretation of pragmatic effects, and recognition of contextual nuance.

This combination makes concordancing particularly well suited to mixed-methods research — studies that require both the breadth of corpus evidence and the depth of interpretive analysis.

Application Areas

Concordancing serves researchers and practitioners across a wide range of disciplines:

Corpus linguistics — documenting authentic language use at scale, testing theoretical claims against corpus evidence, and building usage-based grammatical descriptions.

Sociolinguistics — comparing language use across social groups (age, gender, region, register), studying style-shifting, and investigating language and identity.

Historical linguistics — tracking semantic change over time, documenting grammaticalization, and studying obsolescence and innovation.

Language teaching (Data-Driven Learning) — rather than presenting rules abstractly, DDL approaches have learners discover patterns through guided concordance analysis, building more robust intuitions about authentic usage.

Literary and stylistic analysis — concordances support authorship attribution, the study of recurring motifs and themes, and analysis of an author’s stylistic evolution across a career.

Translation studies — parallel concordancing of source and target texts helps translators find consistent equivalents, identify translation strategies, and maintain terminology consistency across large projects.

Lexicography — modern dictionaries rely on corpus evidence obtained through concordancing to identify word senses, document collocations, find authentic example sentences, and discover new words and meanings.

Content analysis and digital humanities — tracking how concepts are discussed in media, studying framing and ideology through keyword analysis, and examining the historical evolution of key terms.

Exercises: What Is Concordancing?

Q1. A researcher notices that the word risk appears frequently in a corpus of financial news. She searches for it in a concordance and finds it consistently appears with words like exposure, mitigation, management, and assessment. What type of linguistic phenomenon is she observing, and what does it tell her about the word?






Q2. A student claims that concordancing is just a fancy search function — no different from Ctrl+F in a word processor. What is the most important limitation of Ctrl+F that concordancing overcomes?






Concordancing Tools

Section Overview

What you will learn: The main categories of concordancing tool — desktop software, web-based corpus interfaces, and programming environments; the strengths and appropriate use cases of each; and why R with quanteda is the recommended environment for reproducible research

Desktop Concordancing Software

Desktop concordancing applications provide powerful functionality without requiring programming knowledge and remain the most widely used tools in teaching and exploratory research.

AntConc (laurenceanthony.net) is the most widely used free concordancing tool. It is cross-platform (Windows, Mac, Linux), requires no installation dependencies, and provides an intuitive interface ideal for teaching and exploratory analysis. Beyond concordancing, it offers collocate analysis, cluster/n-gram extraction, keyword comparison across corpora, and a dispersion plot view. Its main limitations are modest statistical capabilities, limited export options, and no integration with other analytical environments. AntConc is the recommended starting point for researchers new to concordancing.

WordSmith Tools (lexically.net) is commercial software that has long been the professional standard in corpus linguistics. It provides a comprehensive suite of interconnected tools including concordancing, keyword analysis, and dispersion plots, with sophisticated built-in statistics and professional-quality visualisations. Its main limitations are cost (though institutional licences are available) and Windows-only operation.

SketchEngine (sketchengine.eu) is a web and desktop tool with access to pre-loaded corpora in over 90 languages, automatic corpus annotation, a distinctive “Word Sketch” display showing grammatical and collocational behaviour at a glance, and collaboration features for team research. It is subscription-based but widely used in professional and institutional contexts.

ParaConc is purpose-built for parallel texts (source and translation), allowing researchers to search both sides simultaneously and identify translation strategies — essential for translation studies research.

Web-Based Corpus Interfaces

Many large, professionally designed reference corpora are accessible through web interfaces, eliminating setup overhead and providing immediate access to billions of words of annotated text.

The BYU/English-Corpora.org family (english-corpora.org) created by Mark Davies includes several industry-standard reference corpora for English:

  • COCA (Corpus of Contemporary American English) — over 1 billion words (1990–present), balanced across spoken, fiction, magazine, newspaper, and academic genres, updated annually
  • COHA (Corpus of Historical American English) — 400+ million words (1820s–2000s) for diachronic research
  • NOW Corpus — continuously updated web corpus for tracking very recent language change

These interfaces offer sophisticated search, metadata filtering (by genre, date, etc.), frequency trend visualisations, and free academic access. Their main limitation is that researchers cannot upload their own texts.

Lextutor (lextutor.ca) provides free web-based concordancing alongside vocabulary profiling tools, well suited to classroom use and quick lookups without any installation.

Why R and quanteda?

Given the range of excellent GUI tools available, why invest in learning R for concordancing? The answer lies in four properties that become increasingly important as research grows in scale and ambition:

Reproducibility is the most compelling reason. With a GUI tool, the workflow must be documented in prose (“I opened AntConc, loaded corpus X, searched for Y…”) and is difficult for others to replicate exactly. With an R script, every step is explicit, executable, and shareable. Scripts can be submitted alongside manuscripts, satisfying open science requirements and enabling exact replication.

Integration — R allows concordancing to be embedded in a seamless analytical pipeline. Text can be scraped from the web, cleaned, concordanced, collocates can be tested for statistical significance, results can be visualised with ggplot2, and statistical models can be fit — all within a single environment, without the error-prone import/export steps required when switching between tools.

Flexibility and customisation — R allows arbitrary filtering logic, novel analysis approaches, and automation of repetitive tasks that would be tedious or impossible in GUI tools. This tutorial’s custom concordance function (Section 10) illustrates how quanteda can be extended for specific research needs.

Scalability — R handles corpora of any size efficiently, supports parallel processing, and can be deployed on cloud computing infrastructure for very large-scale projects.

When GUI Tools Are Better

R’s advantages come with a learning investment. For quick one-off explorations, classroom demonstrations with non-technical audiences, or analyses that are genuinely simple, GUI tools like AntConc or COCA are faster and more practical. Many experienced corpus linguists use both: GUI tools for rapid exploration and R for the final reproducible analysis.

Exercises: Concordancing Tools

Q3. A postgraduate student is conducting a diachronic study of the word awful across 200 years of American English. She has no programming experience. Which tool is most directly suited to her needs, and why?






Q4. A research team plans to publish a corpus study of hedging in academic writing. They have built their own corpus from PDFs and need to submit their complete workflow alongside the manuscript. Which approach is most appropriate?






Setup

Installing Packages

Code
# Run once — comment out after installation
install.packages(c(
  "quanteda",    # core concordancing and tokenisation
  "dplyr",       # data manipulation
  "stringr",     # string processing
  "writexl",     # Excel export
  "here",        # portable file paths
  "flextable",   # formatted tables
  "tidyr",       # data reshaping
  "ggplot2",     # visualisation
  "checkdown"    # interactive exercises
))

Loading Packages

Code
library(quanteda)
library(dplyr)
library(stringr)
library(writexl)
library(here)
library(flextable)
library(tidyr)
library(ggplot2)
library(checkdown)
Best Practice: Load Packages at the Top

Always load all packages at the very top of your script, before any analysis code. This makes dependencies immediately visible to anyone reading or reusing your script and avoids the common frustration of hitting a “function not found” error halfway through a long analysis.


Loading and Preparing Text

Section Overview

What you will learn: How to load a pre-saved text file into R; what raw text data looks like and why it requires preprocessing; how to clean text for concordancing using stringr; and why data preparation is an essential — not optional — step in any text analysis

Loading Alice in Wonderland

We use Lewis Carroll’s Alice’s Adventures in Wonderland throughout the practical sections of this tutorial. This classic novel provides rich literary language, memorable characters and constructions, sufficient length for meaningful frequency patterns, and is freely available in the public domain via Project Gutenberg.

The text has been pre-downloaded and saved as an .rda file in the tutorial’s data/ folder so that the tutorial renders without requiring an active internet connection at knit time.

Code
rawtext <- readRDS(here::here("tutorials/kwics/data/alice.rda"))   # loads object: rawtext

Let us inspect the first non-empty lines to understand the structure of the raw file:

Code
rawtext[rawtext != ""] |> head(25)
 [1] "Alice’s Adventures in Wonderland"                                       
 [2] "by Lewis Carroll"                                                       
 [3] "CHAPTER I."                                                             
 [4] "Down the Rabbit-Hole"                                                   
 [5] "Alice was beginning to get very tired of sitting by her sister on the"  
 [6] "bank, and of having nothing to do: once or twice she had peeped into"   
 [7] "the book her sister was reading, but it had no pictures or"             
 [8] "conversations in it, “and what is the use of a book,” thought Alice"    
 [9] "“without pictures or conversations?”"                                   
[10] "So she was considering in her own mind (as well as she could, for the"  
[11] "hot day made her feel very sleepy and stupid), whether the pleasure of" 
[12] "making a daisy-chain would be worth the trouble of getting up and"      
[13] "picking the daisies, when suddenly a White Rabbit with pink eyes ran"   
[14] "close by her."                                                          
[15] "There was nothing so _very_ remarkable in that; nor did Alice think it" 
[16] "so _very_ much out of the way to hear the Rabbit say to itself, “Oh"    
[17] "dear! Oh dear! I shall be late!” (when she thought it over afterwards," 
[18] "it occurred to her that she ought to have wondered at this, but at the" 
[19] "time it all seemed quite natural); but when the Rabbit actually _took a"
[20] "watch out of its waistcoat-pocket_, and looked at it, and then hurried" 
[21] "on, Alice started to her feet, for it flashed across her mind that she" 
[22] "had never before seen a rabbit with either a waistcoat-pocket, or a"    
[23] "watch to take out of it, and burning with curiosity, she ran across the"
[24] "field after it, and fortunately was just in time to see it pop down a"  
[25] "large rabbit-hole under the hedge."                                     

The output reveals several features typical of Project Gutenberg files: a title page, legal boilerplate, a table of contents, and chapter headings, all preceding the actual narrative text. These must be handled before analysis.

Cleaning the Text

The 80/20 Rule of Text Analysis

In practice, researchers typically spend around 80% of their time cleaning and preparing data and 20% analysing it. This is not wasted effort — it is essential investment. Contaminated or inconsistently formatted data produces unreliable concordance results. Poor preparation leads to missed matches, false matches, and results that cannot be replicated.

We apply three cleaning steps in sequence:

Code
text <- rawtext |>
  # 1. Collapse all lines into one continuous string
  paste0(collapse = " ") |>
  # 2. Normalise whitespace (multiple spaces → single space)
  stringr::str_squish() |>
  # 3. Remove Project Gutenberg header up to and including "CHAPTER I."
  stringr::str_remove(".*CHAPTER I\\.")
Code
substr(text, 1, 600)
[1] " Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?” So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran"

What each step does:

paste0(collapse = " ") collapses all the separate line elements of the vector into a single continuous string. This is necessary because kwic() works best on continuous text rather than line-by-line vectors.

stringr::str_squish() removes leading and trailing whitespace and reduces any internal sequence of whitespace characters (spaces, tabs, line breaks) to a single space. This standardises spacing throughout the document.

stringr::str_remove(".*CHAPTER I\\.") removes everything from the beginning of the string up to and including the literal text “CHAPTER I.” The .* matches any characters (including newlines if dotall is set), and \\. is an escaped period (. is a special regex character meaning “any character”, so we must escape it to mean a literal full stop).

Exercises: Loading and Preparing Text

Q5. You load a Project Gutenberg text and collapse it with paste0(collapse = " "). Before running str_squish(), you notice the collapsed string contains stretches like "Chapter I Down the Rabbit-Hole". What does str_squish() do to this, and why is this normalisation important for concordancing?






Q6. A colleague skips the str_remove(".*CHAPTER I\\.") step and runs the concordance directly on the uncleaned text. What specific problems might this cause?






Creating Concordances with quanteda

Section Overview

What you will learn: How quanteda::kwic() works; the structure of the KWIC output table; how to control the context window size; how to search for single words, phrases, and case-insensitive patterns; and how to interpret and count concordance results

Basic KWIC Extraction

The kwic() function is quanteda’s concordancing engine. It requires a tokens object as its first argument — raw text must be passed through quanteda::tokens() before it can be concordanced.

Code
mykwic <- quanteda::kwic(
  quanteda::tokens(text),
  pattern = quanteda::phrase("Alice")
) |>
  as.data.frame()

docname

from

to

pre

keyword

post

pattern

text1

4

4

Down the Rabbit-Hole

Alice

was beginning to get very

Alice

text1

63

63

a book , ” thought

Alice

“ without pictures or conversations

Alice

text1

143

143

in that ; nor did

Alice

think it so _very_ much

Alice

text1

229

229

and then hurried on ,

Alice

started to her feet ,

Alice

text1

299

299

In another moment down went

Alice

after it , never once

Alice

text1

338

338

down , so suddenly that

Alice

had not a moment to

Alice

text1

521

521

“ Well ! ” thought

Alice

to herself , “ after

Alice

text1

647

647

for , you see ,

Alice

had learnt several things of

Alice

The Concordance Table Structure

The output is a data frame with six key columns:

Column Contents
docname Source document name (useful for multi-text corpora)
from Start position (token index) of the match
to End position (token index) of the match
pre Left context — tokens to the left of the keyword
keyword The matched token(s)
post Right context — tokens to the right of the keyword

The pre and post columns contain the context window. By default, kwic() returns 5 tokens on each side.

Counting Matches

Code
nrow(mykwic)
[1] 386

There are 386 occurrences of “Alice” in the text. We can also see what exact forms were matched:

Code
table(mykwic$keyword)

Alice 
  386 

Adjusting the Context Window

The window argument controls how many tokens appear on each side of the keyword. The default is 5; wider windows provide more context for interpretation, while narrower windows are better for studying immediate collocates.

Code
mykwic_wide <- quanteda::kwic(
  quanteda::tokens(text),
  pattern = quanteda::phrase("Alice"),
  window  = 10
) |>
  as.data.frame()

head(mykwic_wide, 4)
  docname from  to                                                pre keyword
1   text1    4   4                               Down the Rabbit-Hole   Alice
2   text1   63  63              what is the use of a book , ” thought   Alice
3   text1  143 143 was nothing so _very_ remarkable in that ; nor did   Alice
4   text1  229 229           and looked at it , and then hurried on ,   Alice
                                                post pattern
1  was beginning to get very tired of sitting by her   Alice
2 “ without pictures or conversations ? ” So she was   Alice
3          think it so _very_ much out of the way to   Alice
4    started to her feet , for it flashed across her   Alice

Guidelines for window size:

  • 3 tokens — immediate collocates; tight focus on the word’s closest neighbours
  • 5 tokens (default) — a good general starting point; captures most relevant local context
  • 10–15 tokens — sentence-level context; useful for pragmatic and discourse analysis
  • 20+ tokens — paragraph-level context; rarely needed and can make concordance lines unwieldy

Searching Multiple Patterns Simultaneously

Pass a character vector to pattern (with phrase()) to search for several expressions in one call:

Code
alice_variants <- c("poor Alice", "little Alice", "dear Alice")

kwic_variants <- quanteda::kwic(
  quanteda::tokens(text),
  pattern = quanteda::phrase(alice_variants)
) |>
  as.data.frame()

table(kwic_variants$keyword)

little Alice   poor Alice   Poor Alice 
           3           10            1 
Exercises: Creating Concordances

Q7. You run kwic(tokens(text), pattern = phrase("the Hatter")) and get 37 results. You then run kwic(tokens(text), pattern = phrase("the hatter"), case_insensitive = TRUE) and get 42 results. What explains the 5 additional matches in the second run?






Q8. What is the difference between kwic(tokens(text), pattern = "poor Alice") and kwic(tokens(text), pattern = phrase("poor Alice"))?






Exporting Concordances

Section Overview

What you will learn: How to export concordances to Excel, CSV, and R’s native .rds format; how to use here() for portable file paths; and how to create formatted concordance tables for reports and presentations

Exporting to Excel

Excel is the most widely compatible format for sharing concordances with colleagues who do not use R:

Code
writexl::write_xlsx(mykwic, here::here("output", "alice_concordance.xlsx"))
Using here() for File Paths

The here package constructs file paths relative to your RStudio project root, making your code portable across operating systems and user accounts:

# Hard-coded path — breaks on any other machine
write_xlsx(data, "C:/Users/Martin/Documents/project/output/file.xlsx")

# here() path — works anywhere the project folder is
write_xlsx(data, here::here("output", "file.xlsx"))

Always use here::here() in scripts you intend to share.

Other Export Formats

Code
# CSV — universal plain-text format, ideal for version control
write.csv(mykwic, here::here("output", "concordance.csv"), row.names = FALSE)

# Tab-separated — handles commas in text better than CSV
write.table(mykwic, here::here("output", "concordance.tsv"),
            sep = "\t", row.names = FALSE)

# R native format — preserves all R object attributes, smallest file size
saveRDS(mykwic, here::here("output", "concordance.rds"))

# Reload later
mykwic_reloaded <- readRDS(here::here("output", "concordance.rds"))
Format Best for
.xlsx Sharing with non-R users; easy manual annotation
.csv Plain-text exchange; version control (Git-friendly)
.tsv Large texts with commas in content
.rds R-to-R sharing; preserves object class and attributes

Formatted Tables for Reports

For presentations and papers, use flextable to create publication-ready concordance displays:

Code
mykwic |>
  head(10) |>
  dplyr::select(pre, keyword, post) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .95, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 10) |>
  flextable::set_caption(caption = "Concordance of 'Alice' — first 10 instances.") |>
  flextable::border_outer()

pre

keyword

post

Down the Rabbit-Hole

Alice

was beginning to get very

a book , ” thought

Alice

“ without pictures or conversations

in that ; nor did

Alice

think it so _very_ much

and then hurried on ,

Alice

started to her feet ,

In another moment down went

Alice

after it , never once

down , so suddenly that

Alice

had not a moment to

“ Well ! ” thought

Alice

to herself , “ after

for , you see ,

Alice

had learnt several things of

got to ? ” (

Alice

had no idea what Latitude

else to do , so

Alice

soon began talking again .


Regular Expressions for Pattern Matching

Section Overview

What you will learn: What regular expressions are and why they are essential for flexible concordancing; the three main categories of regex operator (frequency quantifiers, character classes, position anchors); how to use valuetype = "regex" in kwic(); and how to apply regex to find morphological word families, words of specific structures, and complex patterns

What Are Regular Expressions?

A regular expression (regex) is a sequence of characters that describes a search pattern. Where a literal search finds only the exact string specified, a regex can match an entire family of strings defined by a structural rule. For concordancing, this means that instead of running separate searches for walk, walks, walked, and walking, a single regex \\bwalk\\w* finds all of them at once.

Regular expressions operate through three main types of operator:

Frequency Quantifiers

These control how many times a unit must appear:

Symbol

Meaning

Example

?

Preceding item is optional (0 or 1 times)

colou?r → colour, color

*

Preceding item appears 0 or more times

walk\w* → walk, walks, walked, walking

+

Preceding item appears 1 or more times

walk\w+ → walks, walked, walking (not bare walk)

{n}

Preceding item appears exactly n times

\w{5} → any 5-letter word

{n,}

Preceding item appears at least n times

\w{5,} → words of 5 or more letters

{n,m}

Preceding item appears between n and m times

\w{4,6} → words of 4, 5 or 6 letters

Character Classes

These represent sets of characters:

Symbol

Meaning

[ab]

Literal a or b

[A-Z]

Any uppercase letter A through Z

[0-9]

Any digit 0 through 9

[:digit:]

Any digit (equivalent to [0-9])

[:lower:]

Any lowercase letter

[:upper:]

Any uppercase letter

[:alpha:]

Any letter (upper or lower)

[:alnum:]

Any letter or digit

[:punct:]

Any punctuation character

.

Any single character (wildcard)

Position Anchors

These constrain where in a string the pattern must match:

Symbol

Meaning

Example

\\b

Word boundary (between a word character and a non-word character)

\\brun\\b matches run but not running or rerun

\\B

Non-boundary (inside a word, between two word characters)

\\Brun\\B matches the run in rerunning but not standalone run

^

Start of the string

^Alice matches Alice only at the very start of the text

$

End of the string

Alice$ matches Alice only at the very end of the text

Using Regex in kwic()

Set valuetype = "regex" to activate regular expression matching:

Code
# Find all words beginning with "alic" OR "hatt"
kwic_regex <- quanteda::kwic(
  quanteda::tokens(text),
  pattern   = "\\b(alic|hatt)\\w*",
  valuetype = "regex"
) |>
  as.data.frame()

table(kwic_regex$keyword)

   Alice  Alice’s   hatter   Hatter Hatter’s  hatters 
     386       10        1       54        1        1 

Pattern breakdown:

  • \\b — word boundary: the match must begin at the start of a word
  • (alic|hatt) — alternation: either “alic” or “hatt”
  • \\w* — zero or more word characters: captures any continuation of the stem

docname

from

to

pre

keyword

post

pattern

text1

4

4

Down the Rabbit-Hole

Alice

was beginning to get very

\b(alic|hatt)\w*

text1

63

63

a book , ” thought

Alice

“ without pictures or conversations

\b(alic|hatt)\w*

text1

143

143

in that ; nor did

Alice

think it so _very_ much

\b(alic|hatt)\w*

text1

229

229

and then hurried on ,

Alice

started to her feet ,

\b(alic|hatt)\w*

text1

299

299

In another moment down went

Alice

after it , never once

\b(alic|hatt)\w*

text1

338

338

down , so suddenly that

Alice

had not a moment to

\b(alic|hatt)\w*

text1

521

521

“ Well ! ” thought

Alice

to herself , “ after

\b(alic|hatt)\w*

text1

647

647

for , you see ,

Alice

had learnt several things of

\b(alic|hatt)\w*

Common Regex Concordance Patterns

Code
# All morphological forms of "think"
kwic(tokens(text), pattern = "\\bthink\\w*", valuetype = "regex")
# Matches: think, thinks, thinking, thinker, thought (if you add |thought)

# Words ending in "-tion"
kwic(tokens(text), pattern = "\\w+tion\\b", valuetype = "regex")

# Words ending in "-ing"
kwic(tokens(text), pattern = "\\w+ing\\b", valuetype = "regex")

# Exactly four-letter words
kwic(tokens(text), pattern = "\\b\\w{4}\\b", valuetype = "regex")

# Words of ten or more letters
kwic(tokens(text), pattern = "\\b\\w{10,}\\b", valuetype = "regex")

# Words beginning with un- (negative prefix)
kwic(tokens(text), pattern = "\\bun\\w+", valuetype = "regex")

# Words beginning with a vowel
kwic(tokens(text), pattern = "\\b[aeiou]\\w*", valuetype = "regex")
Test Regex Patterns Before Running

Regular expressions can be deceptively fragile. Small errors produce patterns that match either too much (returning thousands of false positives) or too little (missing the intended targets silently). Best practice is to test your pattern on a small sample using stringr::str_detect() or an online regex tester (e.g., regex101.com) before running it on a full corpus. Always inspect a random sample of the results to verify the pattern is working as intended.

Exercises: Regular Expressions

Q9. A researcher wants to find all words in the Alice text that begin with the prefix re- (such as return, repeat, remember). She writes the pattern "re\\w*" with valuetype = "regex". What is the problem with this pattern, and how should it be fixed?






Q10. You want to find all words in the text that end in either -ful or -less (e.g. careful, careless, wonderful, hopeless). Write the regex pattern you would use.






Filtering and Sorting Concordances

Section Overview

What you will learn: How to use dplyr pipelines to filter concordance output by context patterns; how to sort concordances alphabetically and by collocate frequency; how to extract the immediate left and right neighbours of a keyword; and why sorting and filtering are essential for moving from raw concordance output to linguistic insight

Why Filtering and Sorting Matter

A raw concordance of a high-frequency word in a full-length novel may contain hundreds or thousands of lines. The analytical work of concordancing only begins once the raw output has been filtered to relevant instances and sorted in ways that group similar contexts together. Two operations are central to this:

Filtering restricts the concordance to lines meeting a specified context condition — for example, instances of said preceded by a character name, or instances of very followed by an adjective. Filtering uses dplyr::filter() in combination with stringr::str_detect().

Sorting reorders the concordance lines to surface patterns. Alphabetical sorting by the word immediately following the keyword groups lines that share the same collocate. Frequency sorting prioritises the most common collocates, immediately revealing the strongest patterns in the data.

Filtering by Context Pattern

The following pipeline finds instances of alice only when the immediately preceding word is poor or little:

Code
kwic_filtered <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "alice",
  case_insensitive = TRUE
) |>
  as.data.frame() |>
  # Keep only lines where the last word of the left context is "poor" or "little"
  dplyr::filter(stringr::str_detect(pre, "(poor|little)$"))

nrow(kwic_filtered)
[1] 13

docname

from

to

pre

keyword

post

pattern

text1

1,542

1,542

through , ” thought poor

Alice

, “ it would be

alice

text1

1,725

1,725

” but the wise little

Alice

was not going to do

alice

text1

2,131

2,131

but , alas for poor

Alice

! when she got to

alice

text1

2,333

2,333

now , ” thought poor

Alice

, “ to pretend to

alice

text1

3,605

3,605

words , ” said poor

Alice

, and her eyes filled

alice

text1

6,877

6,877

it ! ” pleaded poor

Alice

. “ But you’re so

alice

text1

7,291

7,291

! ” And here poor

Alice

began to cry again ,

alice

text1

8,240

8,240

home , ” thought poor

Alice

, “ when one wasn’t

alice

text1

11,789

11,789

it ! ” pleaded poor

Alice

in a piteous tone .

alice

text1

19,142

19,142

This answer so confused poor

Alice

, that she let the

alice

Pattern anatomy: str_detect(pre, "(poor|little)$") checks whether the pre column (the left context string) ends with ($) either poor or little. The $ anchor is important: without it, the filter would also match lines where poor or little appears earlier in the left context but is not the immediately preceding word.

Multiple Conditions

Conditions can be combined with & (AND) or | (OR):

Code
# Find "said" preceded by a character name AND followed by a comma or period
kwic_said <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "said"
) |>
  as.data.frame() |>
  dplyr::filter(
    stringr::str_detect(pre,  "(Alice|Hatter|Queen|King|Cat|Mouse)$"),
    stringr::str_detect(post, "^[,.]")
  )

head(kwic_said, 8)
  docname  from    to                    pre keyword
1   text1 15850 15850  direction , ” the Cat    said
2   text1 17618 17618     yet ? ” the Hatter    said
3   text1 17747 17747   don’t ! ” the Hatter    said
4   text1 30643 30643      must , ” the King    said
5   text1 31312 31312 important , ” the King    said
6   text1 32748 32748   verdict , ” the King    said
                               post pattern
1            , waving its right paw    said
2          , turning to Alice again    said
3 , tossing his head contemptuously    said
4           , with a melancholy air    said
5             , turning to the jury    said
6         , for about the twentieth    said

This pipeline extracts speech acts of the form “[Character name] said, …” — a common construction in narrative fiction.

Alphabetical Sorting

Sorting alphabetically by the right context groups together lines with the same immediately following word, making collocational patterns immediately visible:

Code
kwic_sorted_alpha <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "alice",
  case_insensitive = TRUE
) |>
  as.data.frame() |>
  dplyr::arrange(post)

head(kwic_sorted_alpha, 8)
  docname  from    to                     pre keyword                     post
1   text1  7754  7754       happen : “ ‘ Miss   Alice   ! Come here directly ,
2   text1  2888  2888  the garden door . Poor   Alice         ! It was as much
3   text1  2131  2131     but , alas for poor   Alice        ! when she got to
4   text1 30891 30891      voice , the name “   Alice        ! ” CHAPTER XII .
5   text1  8423  8423      “ Oh , you foolish   Alice ! ” she answered herself
6   text1  2606  2606 and curiouser ! ” cried   Alice        ( she was so much
7   text1 25861 25861      I haven’t , ” said   Alice        ) — “ and perhaps
8   text1 32275 32275     explain it , ” said   Alice        , ( she had grown
  pattern
1   alice
2   alice
3   alice
4   alice
5   alice
6   alice
7   alice
8   alice

Frequency Sorting

Frequency sorting identifies the most common collocates — the words that appear most often immediately before or after the keyword:

Code
kwic_sorted_freq <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "alice",
  case_insensitive = TRUE
) |>
  as.data.frame() |>
  # Extract the first word of the right context
  dplyr::mutate(post1 = stringr::str_extract(post, "^\\w+")) |>
  # Count how often each right-context word occurs
  dplyr::add_count(post1, name = "post1_freq") |>
  # Sort from most to least frequent
  dplyr::arrange(dplyr::desc(post1_freq))

head(kwic_sorted_freq |> dplyr::select(pre, keyword, post, post1, post1_freq), 12)
                        pre keyword                                post post1
1        a book , ” thought   Alice “ without pictures or conversations  <NA>
2  through , ” thought poor   Alice                     , “ it would be  <NA>
3      here before , ” said   Alice                   , ) and round the  <NA>
4  curious feeling ! ” said   Alice                       ; “ I must be  <NA>
5       but , alas for poor   Alice                   ! when she got to  <NA>
6      now , ” thought poor   Alice                   , “ to pretend to  <NA>
7           eat it , ” said   Alice                       , “ and if it  <NA>
8   and curiouser ! ” cried   Alice                   ( she was so much  <NA>
9       to them , ” thought   Alice                 , “ or perhaps they  <NA>
10   the garden door . Poor   Alice                    ! It was as much  <NA>
11     of yourself , ” said   Alice                    , “ a great girl  <NA>
12      words , ” said poor   Alice               , and her eyes filled  <NA>
   post1_freq
1         163
2         163
3         163
4         163
5         163
6         163
7         163
8         163
9         163
10        163
11        163
12        163
Code
# Summary table: top 10 right-context collocates
kwic_sorted_freq |>
  dplyr::distinct(post1, post1_freq) |>
  dplyr::arrange(dplyr::desc(post1_freq)) |>
  head(10) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .4, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 11) |>
  flextable::set_caption(caption = "Top 10 words immediately following 'alice'.") |>
  flextable::border_outer()

post1

post1_freq

163

was

17

thought

12

had

11

said

11

could

11

replied

9

did

9

looked

8

to

7

Extracting N-gram Collocates

It is often useful to extract not just the immediately adjacent word but the first two or three words on each side, building a more complete collocational profile:

Code
kwic_ngram <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "alice",
  case_insensitive = TRUE,
  window  = 5
) |>
  as.data.frame() |>
  dplyr::rowwise() |>
  dplyr::mutate(
    post1 = stringr::str_split(post, " ")[[1]][1],
    post2 = stringr::str_split(post, " ")[[1]][2],
    pre1  = dplyr::last(stringr::str_split(pre, " ")[[1]]),
    pre2  = rev(stringr::str_split(pre, " ")[[1]])[2]
  ) |>
  dplyr::ungroup()

# Most common bigrams following "alice"
kwic_ngram |>
  dplyr::count(post1, post2, sort = TRUE) |>
  head(10) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .4, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 11) |>
  flextable::set_caption(caption = "Most frequent bigrams immediately following 'alice'.") |>
  flextable::border_outer()

post1

post2

n

.

50

,

20

;

13

,

and

9

did

not

9

,

as

6

,

who

6

:

6

in

a

6

to

herself

6

Exercises: Filtering and Sorting

Q11. You extract a concordance of the word thought in the Alice text and want to find only instances where the preceding context ends with she (i.e. “she thought”). Which dplyr::filter() call achieves this?






Q12. After sorting a concordance of very by frequency of the immediately following word, you find that much is the most frequent right-context collocate. What does this tell you about the grammatical behaviour of very in the text, and what follow-up analysis might you do?






Working with Spoken Transcripts

Section Overview

What you will learn: How spoken language transcripts differ structurally from written text; how to load and preprocess ICE-Ireland transcript files; how to handle annotation markup in concordancing; and how to extract and analyse discourse markers in spoken data

Characteristics of Spoken Transcripts

Spoken language transcripts differ from written texts in several ways that require adapted preprocessing:

Speaker turn structure — transcripts are organised by speaker turns, often with speaker IDs encoded in annotation tags.

Paralinguistic markers — laughter, coughing, pauses, and overlaps are typically encoded using special markup: <,> for pauses, <&> laughter </&> for paralinguistic events.

Incomplete and non-standard forms — spoken language contains false starts, filled pauses (uh, um), and incomplete utterances that do not correspond to standard written sentences.

Metadata headers — corpus transcripts typically begin with file-level metadata: recording date, speaker demographics, and topic information.

These features mean that the same preprocessing pipeline used for written texts will not work cleanly for transcripts. The approach must be adapted to the specific annotation conventions of the corpus being used.

Loading ICE-Ireland Transcripts

We work with a sample of five files from the spoken dialogue section of the International Corpus of English — Irish component:

Code
files <- paste0("tutorials/kwics/data/ICEIrelandSample/S1A-00", 1:5, ".txt")

transcripts <- sapply(files, readLines, USE.NAMES = TRUE)
Code
# Inspect the first 12 lines of the first file
transcripts[[1]][1:12]
 [1] "<S1A-001 Riding>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
 [2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
 [3] "<I>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [4] "<S1A-001$A> <#> Well how did the riding go tonight"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [5] "<S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&>"                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [6] "<S1A-001$A> <#> What did you call your horse"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 [7] "<S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
 [8] "<S1A-001$A> <#> And how did Mabel do"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
 [9] "<S1A-001$B> <#> Did you not see her whenever she was going over the jumps <#> There was one time her horse refused and it refused three times <#> And then <,> she got it round and she just lined it up straight and she just kicked it and she hit it with the whip <,> and over it went the last time you know <#> And Stephanie told her she was very determined and very well-ridden <&> laughter </&> because it had refused the other times you know <#> But Stephanie wouldn't let her give up on it <#> She made her keep coming back and keep coming back <,> until <,> it jumped it you know <#> It was good"
[10] "<S1A-001$A> <#> Yeah I 'm not so sure her jumping 's improving that much <#> She uh <,> seemed to be holding the reins very tight"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[11] "<S1A-001$B> <#> Yeah she was <#> That 's what Stephanie said <#> <{> <[> She </[> needed to <,> give the horse its head"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
[12] "<S1A-001$A> <#> <[> Mm </[> </{>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

The output reveals the ICE annotation conventions:

  • <S1A-001 Riding> — file header with ID and title
  • <I> — transcript start marker
  • <S1A-001$A> — speaker A in file 001
  • <#> — speech unit boundary
  • <,> — pause
  • <&> laughter </&> — paralinguistic event

Collapsing and Cleaning Transcripts

For basic concordancing we collapse each transcript to a single string and normalise whitespace. We retain the annotation tags for now — they can be used later to extract speaker-specific data:

Code
transcripts_collapsed <- sapply(transcripts, function(x) {
  x |>
    paste0(collapse = " ") |>
    stringr::str_squish()
})

# Preview the first 400 characters of the first transcript
substr(transcripts_collapsed[[1]], 1, 400)
[1] "<S1A-001 Riding> <I> <S1A-001$A> <#> Well how did the riding go tonight <S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&> <S1A-001$A> <#> What did you call your horse <S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh <S1A-001$A> <#> And how did Mabel do <S1A-0"

Concordancing Transcripts

We search for the discourse marker you know, a high-frequency pragmatic expression in spoken Irish English used to signal common ground, manage turn-taking, and mark informational status:

Code
kwic_youknow <- quanteda::kwic(
  # Use "fasterword" tokeniser to preserve annotation tags as tokens
  quanteda::tokens(transcripts_collapsed, what = "fasterword"),
  pattern = quanteda::phrase("you know"),
  window  = 10
) |>
  as.data.frame() |>
  # Clean up document names to show only the file ID
  dplyr::mutate(docname = stringr::str_extract(docname, "S1A-\\d+"))

docname

from

to

pre

keyword

post

pattern

S1A-001

42

43

let me jump <,> that was only the fourth time

you know

<#> It was great <&> laughter </&> <S1A-001$A> <#> What

you know

S1A-001

140

141

the whip <,> and over it went the last time

you know

<#> And Stephanie told her she was very determined and

you know

S1A-001

164

165

<&> laughter </&> because it had refused the other times

you know

<#> But Stephanie wouldn't let her give up on it

you know

S1A-001

193

194

and keep coming back <,> until <,> it jumped it

you know

<#> It was good <S1A-001$A> <#> Yeah I 'm not

you know

S1A-001

402

403

'd be far better waiting <,> for that one <,>

you know

and starting anew fresh <S1A-001$A> <#> Yeah but I mean

you know

S1A-001

443

444

the best goes top of the league <,> <{> <[>

you know

</[> <S1A-001$A> <#> <[> So </[> </{> it 's like

you know

S1A-001

484

485

I 'm not sure now <#> We didn't discuss it

you know

<S1A-001$A> <#> Well it sounds like more money <S1A-001$B> <#>

you know

S1A-001

598

599

on Monday and do without her lesson on Tuesday <,>

you know

<#> But I was keeping her going cos I says

you know

S1A-001

727

728

to take it tomorrow <,> that she could take her

you know

the wee shoulder bag she has <S1A-001$A> <#> Mhm <S1A-001$B>

you know

S1A-001

808

809

<,> and <,> sort of show them around <,> uhm

you know

their timetable and <,> give them their timetable and show

you know

Why what = "fasterword"?

The default quanteda::tokens() tokeniser strips punctuation and special characters. For annotated transcripts, this would remove the <#>, <,>, and speaker tags — information that may be important for the analysis. what = "fasterword" tokenises by whitespace only, preserving tags as tokens. A wider context window (here, 10) is also appropriate for spoken data because pauses, tags, and fillers occupy tokens within the window.

Distribution Across Files

We can compare the frequency of you know across the five files:

Code
kwic_youknow |>
  dplyr::count(docname, name = "n_youknow") |>
  dplyr::arrange(dplyr::desc(n_youknow)) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .35, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 11) |>
  flextable::set_caption(caption = "Frequency of 'you know' per file.") |>
  flextable::border_outer()

docname

n_youknow

S1A-001

18

S1A-005

15

S1A-002

14

S1A-004

14

S1A-003

13

Filtering Out Annotation Tags

For some analyses, it is preferable to work with clean text from which all annotation markup has been removed. We can strip tags before concordancing:

Code
transcripts_clean <- sapply(transcripts_collapsed, function(x) {
  x |>
    # Remove all XML-style tags
    stringr::str_remove_all("<[^>]+>") |>
    # Remove multiple spaces left by tag removal
    stringr::str_squish()
})

substr(transcripts_clean[[1]], 1, 300)
[1] "Well how did the riding go tonight It was good so it was Just I I couldn't believe that she was going to let me jump that was only the fourth time you know It was great laughter What did you call your horse I can't remember Oh Mary 's Town oh And how did Mabel do Did you not see her whenever she was"
Code
kwic_yk_clean <- quanteda::kwic(
  quanteda::tokens(transcripts_clean),
  pattern = quanteda::phrase("you know"),
  window  = 7
) |>
  as.data.frame()

head(kwic_yk_clean, 5)
                                            docname from  to
1 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt   33  34
2 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt  115 116
3 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt  136 137
4 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt  161 162
5 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt  235 236
                                     pre  keyword
1     jump that was only the fourth time you know
2         and over it went the last time you know
3 because it had refused the other times you know
4    keep coming back until it jumped it you know
5      jumping for quite a few weeks now you know
                                    post  pattern
1     It was great laughter What did you you know
2    And Stephanie told her she was very you know
3 But Stephanie wouldn't let her give up you know
4                 It was good Yeah I ' m you know
5   any proper jumping really And so she you know
Exercises: Spoken Transcripts

Q13. You are studying hedging in spoken Irish English and want to find all instances of kind of and sort of in the ICE transcripts. Write the kwic() call that would achieve this using both the annotated and cleaned versions of the transcripts.






Q14. After running a concordance of you know on the annotated transcripts, you notice that many concordance lines contain annotation tags like <#> and <,> in the context windows. A colleague says you should always remove the tags before concordancing. Do you agree?






A Custom Concordance Function

Section Overview

What you will learn: How quanteda::kwic() works internally; how to build an improved custom concordance function with character-based (rather than token-based) context windows, structured output with named columns, and input validation; and when a custom function provides capabilities that kwic() does not

Why Build a Custom Function?

quanteda::kwic() is an excellent, well-tested concordancer for the vast majority of use cases. There are, however, situations where a custom function is useful:

  • Character-based context windowskwic() measures context in tokens. For some analyses (e.g. studies of typographic or layout features, or working with very short texts), a character-count window is more appropriate.
  • Fine-grained output control — a custom function can return exactly the columns and naming conventions your workflow requires without post-hoc renaming.
  • Educational transparency — writing the function from scratch makes the mechanics of concordancing explicit and demystifiable.
  • Integration with non-standard inputs — for data sources that do not fit naturally into quanteda’s corpus/tokens workflow, a character-level function may be simpler.

The Improved Custom Function

The function below improves on the approach in the original draft in several ways:

  1. Input validation — it checks that the text and pattern are non-empty and the context length is positive, issuing informative error messages if not.
  2. Vectorised output — it returns a properly typed tibble rather than a matrix-derived data frame.
  3. Named, clean columnsLeft, Node, Right, DocID, and MatchID are returned for clarity.
  4. Multiple document support — the function accepts a named vector and records the source document for each hit.
  5. Safe handling of edge positions — matches near the start or end of a text are handled without errors by clamping the window to the text boundaries.
Code
concordance <- function(texts, pattern, context = 80,
                        ignore_case = FALSE, perl = TRUE) {
  # ── Input validation ──────────────────────────────────────────────────────
  if (!is.character(texts) || length(texts) == 0)
    stop("`texts` must be a non-empty character vector.")
  if (!is.character(pattern) || length(pattern) != 1 || nchar(pattern) == 0)
    stop("`pattern` must be a single non-empty character string (regex allowed).")
  if (!is.numeric(context) || context < 1)
    stop("`context` must be a positive integer (number of characters per side).")

  context <- as.integer(context)

  # ── Ensure texts are named (for DocID column) ─────────────────────────────
  if (is.null(names(texts))) names(texts) <- paste0("doc", seq_along(texts))

  # ── Process each document ─────────────────────────────────────────────────
  results <- purrr::imap(texts, function(txt, doc_id) {

    # Find all match positions
    m <- gregexpr(pattern, txt,
                  ignore.case = ignore_case,
                  perl        = perl)[[1]]

    # No matches in this document
    if (m[1] == -1) return(NULL)

    match_starts  <- as.integer(m)
    match_lengths <- attr(m, "match.length")
    match_ends    <- match_starts + match_lengths - 1L
    txt_len       <- nchar(txt)

    purrr::pmap_dfr(
      list(match_starts, match_ends, seq_along(match_starts)),
      function(ms, me, idx) {

        # Character-based window, clamped to text boundaries
        left_start  <- max(1L,        ms - context)
        right_end   <- min(txt_len,   me + context)

        left  <- substr(txt, left_start, ms - 1L)
        node  <- substr(txt, ms,         me)
        right <- substr(txt, me + 1L,    right_end)

        # Trim to nearest word boundary for cleaner display
        left  <- stringr::str_remove(left,  "^\\S*\\s")   # remove partial first word
        right <- stringr::str_remove(right, "\\s\\S*$")   # remove partial last word

        tibble::tibble(
          DocID   = doc_id,
          MatchID = idx,
          Left    = left,
          Node    = node,
          Right   = right
        )
      }
    )
  })

  # ── Combine and return ────────────────────────────────────────────────────
  out <- dplyr::bind_rows(results)

  if (nrow(out) == 0) {
    message("No matches found for pattern: ", pattern)
    return(tibble::tibble(DocID   = character(),
                          MatchID = integer(),
                          Left    = character(),
                          Node    = character(),
                          Right   = character()))
  }

  out
}

Using the Custom Function

Code
# Search for "you know" with 60-character context windows
kwic_custom <- concordance(
  texts   = transcripts_collapsed,
  pattern = "you know",
  context = 60
)

nrow(kwic_custom)
[1] 62
Code
head(kwic_custom, 6)
# A tibble: 6 × 5
  DocID                                             MatchID Left     Node  Right
  <chr>                                               <int> <chr>    <chr> <chr>
1 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       1 "was go… you … " <#…
2 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       2 "hit it… you … " <#…
3 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       3 "<&> la… you … " <#…
4 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       4 "back a… you … " <#…
5 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       5 "I said… you … " an…
6 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       6 "<,> wh… you … " </…

DocID

MatchID

Left

Node

Right

tutorials/kwics/data/ICEIrelandSample/S1A-001.txt

1

was going to let me jump <,> that was only the fourth time

you know

<#> It was great <&> laughter </&> <S1A-001$A> <#> What

tutorials/kwics/data/ICEIrelandSample/S1A-001.txt

2

hit it with the whip <,> and over it went the last time

you know

<#> And Stephanie told her she was very determined and

tutorials/kwics/data/ICEIrelandSample/S1A-001.txt

3

<&> laughter </&> because it had refused the other times

you know

<#> But Stephanie wouldn't let her give up on it <#> She

tutorials/kwics/data/ICEIrelandSample/S1A-001.txt

4

back and keep coming back <,> until <,> it jumped it

you know

<#> It was good <S1A-001$A> <#> Yeah I 'm not so sure her

tutorials/kwics/data/ICEIrelandSample/S1A-001.txt

5

I said she 'd be far better waiting <,> for that one <,>

you know

and starting anew fresh <S1A-001$A> <#> Yeah but I mean

tutorials/kwics/data/ICEIrelandSample/S1A-001.txt

6

<,> whoever 's the best goes top of the league <,> <{> <[>

you know

</[> <S1A-001$A> <#> <[> So </[> </{> it 's like another

tutorials/kwics/data/ICEIrelandSample/S1A-001.txt

7

I got <#> I 'm not sure now <#> We didn't discuss it

you know

<S1A-001$A> <#> Well it sounds like more money <S1A-001$B>

tutorials/kwics/data/ICEIrelandSample/S1A-001.txt

8

go on Monday and do without her lesson on Tuesday <,>

you know

<#> But I was keeping her going cos I says oh I wouldn't

Regex Patterns and Case Insensitivity

The function accepts any R regular expression in its pattern argument:

Code
# Find "kind of" and "sort of" (hedges), case-insensitive
kwic_hedges <- concordance(
  texts       = transcripts_clean,
  pattern     = "\\b(kind|sort) of\\b",
  context     = 70,
  ignore_case = TRUE
)

nrow(kwic_hedges)
[1] 19
Code
head(kwic_hedges, 5)
# A tibble: 5 × 5
  DocID                                             MatchID Left     Node  Right
  <chr>                                               <int> <chr>    <chr> <chr>
1 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       1 "or any… sort… " sh…
2 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       2 "months… sort… " fl…
3 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       3 "them f… sort… " wa…
4 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       4 "of the… sort… " wa…
5 tutorials/kwics/data/ICEIrelandSample/S1A-002.txt       1 "Mass l… kind… " lo…

Comparing Token-Based and Character-Based Windows

The key practical difference between kwic() and our custom function is the unit of the context window:

Code
# quanteda: 7-token window
kwic_token <- quanteda::kwic(
  quanteda::tokens(transcripts_clean[[1]]),
  pattern = quanteda::phrase("you know"),
  window  = 7
) |>
  as.data.frame() |>
  head(3) |>
  dplyr::mutate(type = "token-based (7 tokens)")

# Custom: 50-character window
kwic_char <- concordance(
  texts   = transcripts_clean[1],
  pattern = "you know",
  context = 50
) |>
  head(3) |>
  dplyr::mutate(type = "char-based (50 chars)")

# Show the difference
cat("=== Token-based (7 tokens) ===\n")
=== Token-based (7 tokens) ===
Code
kwic_token[1, c("pre", "keyword", "post")]
                                 pre  keyword
1 jump that was only the fourth time you know
                                post
1 It was great laughter What did you
Code
cat("\n=== Character-based (50 chars) ===\n")

=== Character-based (50 chars) ===
Code
kwic_char[1, c("Left", "Node", "Right")]
# A tibble: 1 × 3
  Left                                            Node     Right                
  <chr>                                           <chr>    <chr>                
1 "to let me jump that was only the fourth time " you know " It was great laugh…

With a token-based window, a text full of short function words will show more words on each side than a text with many long technical terms — because a token is a token regardless of length. With a character-based window, the amount of displayed text is constant regardless of token length, which can be preferable when the visual appearance of the concordance matters.

Exercises: Custom Concordance Function

Q15. The custom concordance() function uses gregexpr() to find match positions and then calls substr() to extract context. What would happen if you removed the line that trims to the nearest word boundary, and why was that step added?






Q16. You want to use the custom concordance() function to find all instances of British English -ise spellings (e.g. recognise, organise, realise) in a corpus of newspaper editorials. Write the function call you would use.






Reproducible Workflows

Section Overview

What you will learn: How to organise a concordancing project for reproducibility; essential script documentation conventions; how to parameterise analyses for easy modification; and how to export results in formats that support open science

Project Structure

A well-organised project folder makes it easy to return to an analysis months later, share it with collaborators, or submit it alongside a manuscript:

my-concordance-project/
├── data/
│   ├── raw/              # Original texts — never edit these
│   └── processed/        # Cleaned texts and intermediate objects
├── scripts/
│   ├── 01-load-clean.R   # Data loading and preprocessing
│   ├── 02-concordance.R  # KWIC extraction
│   └── 03-analysis.R     # Filtering, sorting, statistics
├── output/
│   ├── concordances/     # Saved KWIC tables
│   └── figures/          # Plots and visualisations
├── README.md             # Project description
└── project.Rproj         # RStudio project file

Script Documentation

A reproducible script begins with a header block and uses parameters rather than hard-coded values:

Code
# ============================================================
# Title:   Concordance analysis of hedging in Alice in Wonderland
# Author:  Martin Schweinberger
# Date:    2026-05-01
# Purpose: Extract and analyse instances of epistemic hedges
# ============================================================

library(quanteda)
library(dplyr)
library(writexl)
library(here)

# ── Parameters (change these, not the code below) ───────────
CONTEXT_WINDOW <- 7          # tokens per side
MIN_FREQ       <- 3          # minimum collocate frequency to report
OUTPUT_DIR     <- here::here("output", "concordances")

# ── Load and clean text ──────────────────────────────────────
load("tutorials/kwics/data/alice.rda")   # loads object: rawtext
text_clean <- paste0(rawtext, collapse = " ") |>
  stringr::str_squish() |>
  stringr::str_remove(".*CHAPTER I\\.")

# ── Extract concordances ─────────────────────────────────────
kwic_hedge <- quanteda::kwic(
  quanteda::tokens(text_clean),
  pattern = quanteda::phrase(c("perhaps", "might", "seemed to",
                                "appeared to", "as if")),
  window  = CONTEXT_WINDOW
) |>
  as.data.frame()

# ── Save results ──────────────────────────────────────────────
dir.create(OUTPUT_DIR, showWarnings = FALSE, recursive = TRUE)
writexl::write_xlsx(
  kwic_hedge,
  file.path(OUTPUT_DIR, paste0("hedges_kwic_", Sys.Date(), ".xlsx"))
)

# ── Session info for reproducibility ─────────────────────────
sessionInfo()

Key practices demonstrated here:

  • ParameterisationCONTEXT_WINDOW, MIN_FREQ, and OUTPUT_DIR are defined at the top so the entire analysis can be reconfigured by changing three lines.
  • Datestamped outputSys.Date() in the filename means each run creates a new output file, preserving the history of the analysis.
  • sessionInfo() at the end — records the exact R and package versions used, essential for replication.

Common Mistakes and How to Avoid Them

Top Five Concordancing Pitfalls

1. Forgetting case sensitivity. kwic(tokens(text), phrase("alice")) returns zero results if the text uses “Alice”. Always check capitalisation conventions in your data, and use case_insensitive = TRUE when appropriate.

2. Using regex without valuetype = "regex". The pattern "walk.*" without valuetype = "regex" is treated as a literal glob pattern, not a regular expression. The result may be zero matches or unexpected results.

3. Skipping preprocessing. Running kwic() on uncleaned text means metadata, headers, and formatting artefacts contaminate the concordance.

4. Not inspecting results. Always examine a random sample of concordance lines to verify the pattern matched what you intended. Use sample_n(20) to spot-check.

5. Confusing phrase() and bare strings. Always use phrase() for multi-word search targets. Without it, quanteda may treat the multi-word string as a glob matching individual tokens.


Summary and Further Reading

This tutorial has provided a comprehensive introduction to concordancing with R, covering the conceptual foundations, the tool landscape, and the full practical workflow from loading text to exporting results.

Section 1 established what concordancing is — the systematic KWIC-display extraction of words in context — and why it is central to corpus linguistics: it grounds linguistic claims in observable, verifiable evidence, overcomes the limitations of native-speaker intuition and working memory, and enables both quantitative and qualitative analysis within a single framework.

Section 2 surveyed the concordancing tool landscape: desktop software (AntConc, WordSmith, SketchEngine), web corpus interfaces (COCA, COHA, Lextutor), and R with quanteda. The case for R rests on reproducibility, integration with statistics and visualisation, flexibility, and scalability.

Sections 3 and 4 covered data loading and preprocessing, with emphasis on the importance of cleaning (collapsing lines, normalising whitespace, removing headers) before concordancing.

Section 5 introduced quanteda::kwic() in depth: the structure of the KWIC output table, window size control, phrase search with phrase(), case-insensitive search, and multi-pattern search.

Section 6 introduced regular expressions as a tool for flexible concordancing: frequency quantifiers, character classes, position anchors, and common concordancing patterns including morphological word families, suffix-based searches, and word length filters.

Section 7 extended the workflow to filtering and sorting: using dplyr pipelines with str_detect() to restrict concordances to specific context conditions, and sorting by alphabetical and frequency criteria to surface collocational patterns.

Section 8 addressed spoken language transcripts, covering the structural differences between written and transcribed spoken text, how to handle annotation markup, and the discourse marker you know as a worked example.

Section 9 presented an improved custom concordance function using character-based context windows, input validation, and support for named multi-document input — extending kwic() for use cases that token-based windows cannot serve.

Section 10 demonstrated reproducible workflow conventions: project folder structure, parameterised scripts, datestamped output, and sessionInfo() documentation.

Further reading: Sinclair (1991) remains the foundational reference for concordance-based language description. McEnery and Hardie (2011) provides a comprehensive methodological overview of corpus linguistics. Stubbs (1996) is essential on collocation and semantic prosody. Anthony (2013) discusses tool choice for corpus work. Brezina (2018) covers the statistical analysis of concordance-derived collocations. For quanteda specifically, the package documentation and tutorials at tutorials.quanteda.io are the primary resource.


Citation & Session Info

Schweinberger, Martin. 2026. Finding Words in Text: Concordancing with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/kwics/kwics.html (Version 2026.05.01).

@manual{schweinberger2026kwics,
  author       = {Schweinberger, Martin},
  title        = {Finding Words in Text: Concordancing with R},
  note         = {tutorials/kwics/kwics.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.05.01}
}
AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to substantially expand and restructure a shorter existing LADAL draft tutorial on concordancing. All content — including all R code — was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.

Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] ggplot2_4.0.2    checkdown_0.0.13 tidyr_1.3.2      flextable_0.9.11
[5] here_1.0.2       writexl_1.5.1    stringr_1.5.1    dplyr_1.2.0     
[9] quanteda_4.2.0  

loaded via a namespace (and not attached):
 [1] fastmatch_1.1-6         gtable_0.3.6            xfun_0.56              
 [4] htmlwidgets_1.6.4       lattice_0.22-6          vctrs_0.7.1            
 [7] tools_4.4.2             generics_0.1.3          tibble_3.2.1           
[10] pkgconfig_2.0.3         Matrix_1.7-2            data.table_1.17.0      
[13] RColorBrewer_1.1-3      S7_0.2.1                uuid_1.2-1             
[16] lifecycle_1.0.5         compiler_4.4.2          farver_2.1.2           
[19] textshaping_1.0.0       codetools_0.2-20        litedown_0.9           
[22] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.9        
[25] yaml_2.3.10             pillar_1.10.1           openssl_2.3.2          
[28] fontBitstreamVera_0.1.1 commonmark_2.0.0        stopwords_2.3          
[31] tidyselect_1.2.1        zip_2.3.2               digest_0.6.39          
[34] stringi_1.8.4           purrr_1.0.4             rprojroot_2.1.1        
[37] fastmap_1.2.0           grid_4.4.2              cli_3.6.4              
[40] magrittr_2.0.3          patchwork_1.3.0         utf8_1.2.4             
[43] withr_3.0.2             gdtools_0.5.0           scales_1.4.0           
[46] rmarkdown_2.30          officer_0.7.3           askpass_1.2.1          
[49] ragg_1.3.3              evaluate_1.0.3          knitr_1.51             
[52] markdown_2.0            rlang_1.1.7             Rcpp_1.1.1             
[55] glue_1.8.0              xml2_1.3.6              renv_1.1.7             
[58] rstudioapi_0.17.1       jsonlite_1.9.0          R6_2.6.1               
[61] systemfonts_1.3.1      

Back to top

Back to LADAL home


References

Anthony, Laurence. 2013. “A Critical Look at Software Tools in Corpus Linguistics.” Linguistic Research 30 (2).
Brezina, Vaclav. 2018. Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.
McEnery, Tony, and Andrew Hardie. 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press.
Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Stubbs, Michael. 1996. Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture. Blackwell Oxford.