Finding Words in Text: Concordancing with R

Author

Martin Schweinberger

Preparing the Tutorial Data File

The Alice text must be downloaded once and saved before knitting. Run this code once in your console (not in the tutorial itself):

rawtext <- readLines("https://www.gutenberg.org/files/11/11-0.txt")
dir.create("tutorials/kwics/data", recursive = TRUE, showWarnings = FALSE)
save(rawtext, file = "tutorials/kwics/data/alice.rda")

This creates tutorials/kwics/data/alice.rda, which the tutorial loads at knit time.

Introduction

This tutorial introduces concordancing — one of the most fundamental and powerful methods in corpus linguistics. Concordancing allows researchers to search systematically through large text collections, extracting every occurrence of a word or phrase together with the surrounding context. The resulting display, known as a keyword-in-context (KWIC) display, makes patterns of language use visible that would be impossible to detect through ordinary reading.

The tutorial covers the core concepts of concordancing, a survey of available tools from desktop software to web interfaces to R, and a hands-on practical guide to extracting, filtering, sorting, and analysing concordances using the quanteda package. It includes a section on working with spoken language transcripts and demonstrates how to build a custom concordance function that extends quanteda’s built-in capabilities.

Prerequisite Tutorials

Before working through this tutorial, we recommend familiarity with:

Getting Started with R — R objects, basic syntax, RStudio orientation
Loading and Saving Data in R — reading files into R
String Processing in R — working with text using stringr
Regular Expressions in R — pattern matching with regex

Learning Objectives

By the end of this tutorial you will be able to:

Explain what concordancing is, how the KWIC display works, and why it is central to corpus linguistics
Navigate the landscape of concordancing tools and choose the right tool for different research tasks
Load and preprocess text data for concordancing in R
Extract keyword-in-context concordances using quanteda::kwic()
Use regular expressions to search for morphological variants and complex patterns
Filter and sort concordances using dplyr pipelines to reveal collocational patterns
Work with spoken language transcripts and handle annotation markup
Build and use a custom concordance function for character-based context windows
Export concordances in Excel, CSV, and other formats for further analysis

Citation

Schweinberger, Martin. 2026. Finding Words in Text: Concordancing with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/kwics/kwics.html (Version 2026.05.01).

LADAL Notebook Tool

An interactive Binder notebook that lets you upload your own texts and run the concordancing code without installing R is available here:

Click here to open the interactive concordancing notebook.

What Is Concordancing?

Section Overview

What you will learn: What a concordance is and how the KWIC display works; why concordancing is central to corpus linguistics; how concordances bridge quantitative and qualitative analysis; and a survey of application areas from linguistics to lexicography to digital humanities

The KWIC Display

A concordance is a systematic list of every occurrence of a search term in a text or corpus, presented with its surrounding context. The standard presentation format is the keyword-in-context (KWIC) display, in which the search term — the node word — appears aligned in the centre of each line, with a fixed window of context on either side:

...couldn't help thinking there  must   be more to life than being merely
...you are my density. I          must   go. Please excuse me. I mean, my
...the situation requires that we  must   work together to achieve our aims.
...extraordinary claims require    extraordinary   evidence before they are
...but the Emperor has no clothes!  must   speak truth to power and be heard

This deceptively simple layout makes linguistic patterns visible that are impossible to detect through ordinary reading. By displaying multiple instances simultaneously and aligning the node word vertically, concordances allow researchers to:

observe how words are actually used rather than how we imagine they are used
identify collocational patterns — words that systematically appear nearby
distinguish different meanings or senses of a polysemous word from context
examine grammatical constructions that a word participates in
compare register and genre variation across different text types

From Intuition to Evidence

One of the most important contributions of concordancing to linguistics is the systematic correction of native-speaker intuition. Speakers often hold confident but inaccurate beliefs about their own language use. Concordances provide observable, verifiable evidence that can confirm, nuance, or directly contradict such intuitions.

Some well-documented intuition-corpus mismatches include:

speakers typically believe they use very more than really, but corpus evidence frequently shows the reverse
formal writing is assumed to avoid contractions, yet concordances reveal they are common in specific formal genres
particular collocations assumed to be rare prove pervasive in specific registers once a corpus is examined

This empirical grounding is what distinguishes corpus linguistics from introspection-based approaches and makes concordancing indispensable for rigorous language research.

What Concordances Reveal

Concordancing supports several distinct types of linguistic investigation:

Semantic analysis — concordances allow researchers to identify the different senses of polysemous words, observe how context disambiguates meaning, and study semantic prosody: the tendency of words to collocate preferentially with words that carry a positive or negative evaluative charge. The word cause, for instance, collocates predominantly with negative nouns (cause harm, cause problems, cause damage) even though the word itself is semantically neutral.

Collocational analysis — the study of words that habitually co-occur is one of the core contributions of corpus linguistics to lexicography and language description. Concordances make it possible to build collocational profiles for words and to use association measures (mutual information, log-likelihood) to distinguish significant co-occurrences from accidental ones.

Grammatical investigation — concordances reveal verb complementation patterns (does this verb prefer an infinitival or a gerundive complement?), preposition selection, word order preferences, and evidence for grammaticalization processes.

Discourse and framing analysis — by examining what vocabulary surrounds key terms, researchers can study how concepts are constructed and contested in public discourse, identify ideological positioning through word choice, and track the evolution of discursive strategies over time.

Historical linguistics — applied to diachronic corpora, concordancing can document language change, track the semantic bleaching or expansion of words over centuries, and identify grammaticalization paths.

The Quantitative–Qualitative Bridge

Concordancing is unusual among research methods in that it supports both quantitative and qualitative analysis within a single workflow. The quantitative dimension includes frequency counts (how often does the pattern occur?), collocational strength statistics, and distributional comparisons across sub-corpora. The qualitative dimension involves close reading of individual concordance lines, interpretation of pragmatic effects, and recognition of contextual nuance.

This combination makes concordancing particularly well suited to mixed-methods research — studies that require both the breadth of corpus evidence and the depth of interpretive analysis.

Application Areas

Concordancing serves researchers and practitioners across a wide range of disciplines:

Corpus linguistics — documenting authentic language use at scale, testing theoretical claims against corpus evidence, and building usage-based grammatical descriptions.

Sociolinguistics — comparing language use across social groups (age, gender, region, register), studying style-shifting, and investigating language and identity.

Historical linguistics — tracking semantic change over time, documenting grammaticalization, and studying obsolescence and innovation.

Language teaching (Data-Driven Learning) — rather than presenting rules abstractly, DDL approaches have learners discover patterns through guided concordance analysis, building more robust intuitions about authentic usage.

Literary and stylistic analysis — concordances support authorship attribution, the study of recurring motifs and themes, and analysis of an author’s stylistic evolution across a career.

Translation studies — parallel concordancing of source and target texts helps translators find consistent equivalents, identify translation strategies, and maintain terminology consistency across large projects.

Lexicography — modern dictionaries rely on corpus evidence obtained through concordancing to identify word senses, document collocations, find authentic example sentences, and discover new words and meanings.

Content analysis and digital humanities — tracking how concepts are discussed in media, studying framing and ideology through keyword analysis, and examining the historical evolution of key terms.

Exercises: What Is Concordancing?

Q1. A researcher notices that the word risk appears frequently in a corpus of financial news. She searches for it in a concordance and finds it consistently appears with words like exposure, mitigation, management, and assessment. What type of linguistic phenomenon is she observing, and what does it tell her about the word?

Q2. A student claims that concordancing is just a fancy search function — no different from Ctrl+F in a word processor. What is the most important limitation of Ctrl+F that concordancing overcomes?

Concordancing Tools

Section Overview

What you will learn: The main categories of concordancing tool — desktop software, web-based corpus interfaces, and programming environments; the strengths and appropriate use cases of each; and why R with quanteda is the recommended environment for reproducible research

Desktop Concordancing Software

Desktop concordancing applications provide powerful functionality without requiring programming knowledge and remain the most widely used tools in teaching and exploratory research.

AntConc (laurenceanthony.net) is the most widely used free concordancing tool. It is cross-platform (Windows, Mac, Linux), requires no installation dependencies, and provides an intuitive interface ideal for teaching and exploratory analysis. Beyond concordancing, it offers collocate analysis, cluster/n-gram extraction, keyword comparison across corpora, and a dispersion plot view. Its main limitations are modest statistical capabilities, limited export options, and no integration with other analytical environments. AntConc is the recommended starting point for researchers new to concordancing.

WordSmith Tools (lexically.net) is commercial software that has long been the professional standard in corpus linguistics. It provides a comprehensive suite of interconnected tools including concordancing, keyword analysis, and dispersion plots, with sophisticated built-in statistics and professional-quality visualisations. Its main limitations are cost (though institutional licences are available) and Windows-only operation.

SketchEngine (sketchengine.eu) is a web and desktop tool with access to pre-loaded corpora in over 90 languages, automatic corpus annotation, a distinctive “Word Sketch” display showing grammatical and collocational behaviour at a glance, and collaboration features for team research. It is subscription-based but widely used in professional and institutional contexts.

ParaConc is purpose-built for parallel texts (source and translation), allowing researchers to search both sides simultaneously and identify translation strategies — essential for translation studies research.

Web-Based Corpus Interfaces

Many large, professionally designed reference corpora are accessible through web interfaces, eliminating setup overhead and providing immediate access to billions of words of annotated text.

The BYU/English-Corpora.org family (english-corpora.org) created by Mark Davies includes several industry-standard reference corpora for English:

COCA (Corpus of Contemporary American English) — over 1 billion words (1990–present), balanced across spoken, fiction, magazine, newspaper, and academic genres, updated annually
COHA (Corpus of Historical American English) — 400+ million words (1820s–2000s) for diachronic research
NOW Corpus — continuously updated web corpus for tracking very recent language change

These interfaces offer sophisticated search, metadata filtering (by genre, date, etc.), frequency trend visualisations, and free academic access. Their main limitation is that researchers cannot upload their own texts.

Lextutor (lextutor.ca) provides free web-based concordancing alongside vocabulary profiling tools, well suited to classroom use and quick lookups without any installation.

Why R and quanteda?

Given the range of excellent GUI tools available, why invest in learning R for concordancing? The answer lies in four properties that become increasingly important as research grows in scale and ambition:

Reproducibility is the most compelling reason. With a GUI tool, the workflow must be documented in prose (“I opened AntConc, loaded corpus X, searched for Y…”) and is difficult for others to replicate exactly. With an R script, every step is explicit, executable, and shareable. Scripts can be submitted alongside manuscripts, satisfying open science requirements and enabling exact replication.

Integration — R allows concordancing to be embedded in a seamless analytical pipeline. Text can be scraped from the web, cleaned, concordanced, collocates can be tested for statistical significance, results can be visualised with ggplot2, and statistical models can be fit — all within a single environment, without the error-prone import/export steps required when switching between tools.

Flexibility and customisation — R allows arbitrary filtering logic, novel analysis approaches, and automation of repetitive tasks that would be tedious or impossible in GUI tools. This tutorial’s custom concordance function (Section 10) illustrates how quanteda can be extended for specific research needs.

Scalability — R handles corpora of any size efficiently, supports parallel processing, and can be deployed on cloud computing infrastructure for very large-scale projects.

When GUI Tools Are Better

R’s advantages come with a learning investment. For quick one-off explorations, classroom demonstrations with non-technical audiences, or analyses that are genuinely simple, GUI tools like AntConc or COCA are faster and more practical. Many experienced corpus linguists use both: GUI tools for rapid exploration and R for the final reproducible analysis.

Exercises: Concordancing Tools

Q3. A postgraduate student is conducting a diachronic study of the word awful across 200 years of American English. She has no programming experience. Which tool is most directly suited to her needs, and why?

Q4. A research team plans to publish a corpus study of hedging in academic writing. They have built their own corpus from PDFs and need to submit their complete workflow alongside the manuscript. Which approach is most appropriate?

Setup

Installing Packages

Code

# Run once — comment out after installation
install.packages(c(
  "quanteda",    # core concordancing and tokenisation
  "dplyr",       # data manipulation
  "stringr",     # string processing
  "writexl",     # Excel export
  "here",        # portable file paths
  "flextable",   # formatted tables
  "tidyr",       # data reshaping
  "ggplot2",     # visualisation
  "checkdown"    # interactive exercises
))

Loading Packages

Code

library(quanteda)
library(dplyr)
library(stringr)
library(writexl)
library(here)
library(flextable)
library(tidyr)
library(ggplot2)
library(checkdown)

Best Practice: Load Packages at the Top

Always load all packages at the very top of your script, before any analysis code. This makes dependencies immediately visible to anyone reading or reusing your script and avoids the common frustration of hitting a “function not found” error halfway through a long analysis.

Loading and Preparing Text

Section Overview

What you will learn: How to load a pre-saved text file into R; what raw text data looks like and why it requires preprocessing; how to clean text for concordancing using stringr; and why data preparation is an essential — not optional — step in any text analysis

Loading Alice in Wonderland

We use Lewis Carroll’s Alice’s Adventures in Wonderland throughout the practical sections of this tutorial. This classic novel provides rich literary language, memorable characters and constructions, sufficient length for meaningful frequency patterns, and is freely available in the public domain via Project Gutenberg.

The text has been pre-downloaded and saved as an .rda file in the tutorial’s data/ folder so that the tutorial renders without requiring an active internet connection at knit time.

Code

rawtext <- readRDS(here::here("tutorials/kwics/data/alice.rda"))   # loads object: rawtext

Let us inspect the first non-empty lines to understand the structure of the raw file:

Code

rawtext[rawtext != ""] |> head(25)

 [1] "Alice’s Adventures in Wonderland"                                       
 [2] "by Lewis Carroll"                                                       
 [3] "CHAPTER I."                                                             
 [4] "Down the Rabbit-Hole"                                                   
 [5] "Alice was beginning to get very tired of sitting by her sister on the"  
 [6] "bank, and of having nothing to do: once or twice she had peeped into"   
 [7] "the book her sister was reading, but it had no pictures or"             
 [8] "conversations in it, “and what is the use of a book,” thought Alice"    
 [9] "“without pictures or conversations?”"                                   
[10] "So she was considering in her own mind (as well as she could, for the"  
[11] "hot day made her feel very sleepy and stupid), whether the pleasure of" 
[12] "making a daisy-chain would be worth the trouble of getting up and"      
[13] "picking the daisies, when suddenly a White Rabbit with pink eyes ran"   
[14] "close by her."                                                          
[15] "There was nothing so _very_ remarkable in that; nor did Alice think it" 
[16] "so _very_ much out of the way to hear the Rabbit say to itself, “Oh"    
[17] "dear! Oh dear! I shall be late!” (when she thought it over afterwards," 
[18] "it occurred to her that she ought to have wondered at this, but at the" 
[19] "time it all seemed quite natural); but when the Rabbit actually _took a"
[20] "watch out of its waistcoat-pocket_, and looked at it, and then hurried" 
[21] "on, Alice started to her feet, for it flashed across her mind that she" 
[22] "had never before seen a rabbit with either a waistcoat-pocket, or a"    
[23] "watch to take out of it, and burning with curiosity, she ran across the"
[24] "field after it, and fortunately was just in time to see it pop down a"  
[25] "large rabbit-hole under the hedge."

The output reveals several features typical of Project Gutenberg files: a title page, legal boilerplate, a table of contents, and chapter headings, all preceding the actual narrative text. These must be handled before analysis.

Cleaning the Text

The 80/20 Rule of Text Analysis

In practice, researchers typically spend around 80% of their time cleaning and preparing data and 20% analysing it. This is not wasted effort — it is essential investment. Contaminated or inconsistently formatted data produces unreliable concordance results. Poor preparation leads to missed matches, false matches, and results that cannot be replicated.

We apply three cleaning steps in sequence:

Code

text <- rawtext |>
  # 1. Collapse all lines into one continuous string
  paste0(collapse = " ") |>
  # 2. Normalise whitespace (multiple spaces → single space)
  stringr::str_squish() |>
  # 3. Remove Project Gutenberg header up to and including "CHAPTER I."
  stringr::str_remove(".*CHAPTER I\\.")

Code

substr(text, 1, 600)

[1] " Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?” So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran"

What each step does:

paste0(collapse = " ") collapses all the separate line elements of the vector into a single continuous string. This is necessary because kwic() works best on continuous text rather than line-by-line vectors.

stringr::str_squish() removes leading and trailing whitespace and reduces any internal sequence of whitespace characters (spaces, tabs, line breaks) to a single space. This standardises spacing throughout the document.

stringr::str_remove(".*CHAPTER I\\.") removes everything from the beginning of the string up to and including the literal text “CHAPTER I.” The .* matches any characters (including newlines if dotall is set), and \\. is an escaped period (. is a special regex character meaning “any character”, so we must escape it to mean a literal full stop).

Exercises: Loading and Preparing Text

Q5. You load a Project Gutenberg text and collapse it with paste0(collapse = " "). Before running str_squish(), you notice the collapsed string contains stretches like "Chapter I Down the Rabbit-Hole". What does str_squish() do to this, and why is this normalisation important for concordancing?

Q6. A colleague skips the str_remove(".*CHAPTER I\\.") step and runs the concordance directly on the uncleaned text. What specific problems might this cause?

Creating Concordances with quanteda

Section Overview

What you will learn: How quanteda::kwic() works; the structure of the KWIC output table; how to control the context window size; how to search for single words, phrases, and case-insensitive patterns; and how to interpret and count concordance results

Basic KWIC Extraction

The kwic() function is quanteda’s concordancing engine. It requires a tokens object as its first argument — raw text must be passed through quanteda::tokens() before it can be concordanced.

Code

mykwic <- quanteda::kwic(
  quanteda::tokens(text),
  pattern = quanteda::phrase("Alice")
) |>
  as.data.frame()

docname	from	to	pre	keyword	post	pattern
text1	4	4	Down the Rabbit-Hole	Alice	was beginning to get very	Alice
text1	63	63	a book , ” thought	Alice	“ without pictures or conversations	Alice
text1	143	143	in that ; nor did	Alice	think it so _very_ much	Alice
text1	229	229	and then hurried on ,	Alice	started to her feet ,	Alice
text1	299	299	In another moment down went	Alice	after it , never once	Alice
text1	338	338	down , so suddenly that	Alice	had not a moment to	Alice
text1	521	521	“ Well ! ” thought	Alice	to herself , “ after	Alice
text1	647	647	for , you see ,	Alice	had learnt several things of	Alice

The Concordance Table Structure

The output is a data frame with six key columns:

Column	Contents
`docname`	Source document name (useful for multi-text corpora)
`from`	Start position (token index) of the match
`to`	End position (token index) of the match
`pre`	Left context — tokens to the left of the keyword
`keyword`	The matched token(s)
`post`	Right context — tokens to the right of the keyword

The pre and post columns contain the context window. By default, kwic() returns 5 tokens on each side.

Counting Matches

Code

nrow(mykwic)

[1] 386

There are 386 occurrences of “Alice” in the text. We can also see what exact forms were matched:

Code

table(mykwic$keyword)


Alice 
  386

Adjusting the Context Window

The window argument controls how many tokens appear on each side of the keyword. The default is 5; wider windows provide more context for interpretation, while narrower windows are better for studying immediate collocates.

Code

mykwic_wide <- quanteda::kwic(
  quanteda::tokens(text),
  pattern = quanteda::phrase("Alice"),
  window  = 10
) |>
  as.data.frame()

head(mykwic_wide, 4)

  docname from  to                                                pre keyword
1   text1    4   4                               Down the Rabbit-Hole   Alice
2   text1   63  63              what is the use of a book , ” thought   Alice
3   text1  143 143 was nothing so _very_ remarkable in that ; nor did   Alice
4   text1  229 229           and looked at it , and then hurried on ,   Alice
                                                post pattern
1  was beginning to get very tired of sitting by her   Alice
2 “ without pictures or conversations ? ” So she was   Alice
3          think it so _very_ much out of the way to   Alice
4    started to her feet , for it flashed across her   Alice

Guidelines for window size:

3 tokens — immediate collocates; tight focus on the word’s closest neighbours
5 tokens (default) — a good general starting point; captures most relevant local context
10–15 tokens — sentence-level context; useful for pragmatic and discourse analysis
20+ tokens — paragraph-level context; rarely needed and can make concordance lines unwieldy

Phrase Search

quanteda::phrase() tells kwic() to treat a multi-word expression as a single search unit. Without phrase(), each word would be searched independently.

Code

kwic_pooralice <- quanteda::kwic(
  quanteda::tokens(text),
  pattern = quanteda::phrase("poor Alice")
) |>
  as.data.frame()

nrow(kwic_pooralice)

[1] 11

Code

head(kwic_pooralice, 5)

  docname from   to                    pre    keyword                  post
1   text1 1541 1542 go through , ” thought poor Alice       , “ it would be
2   text1 2130 2131       ; but , alas for poor Alice     ! when she got to
3   text1 2332 2333    use now , ” thought poor Alice     , “ to pretend to
4   text1 2887 2888   to the garden door . Poor Alice      ! It was as much
5   text1 3604 3605   right words , ” said poor Alice , and her eyes filled
     pattern
1 poor Alice
2 poor Alice
3 poor Alice
4 poor Alice
5 poor Alice

Case-Insensitive Search

By default, kwic() is case-sensitive. Setting case_insensitive = TRUE matches all capitalisation variants:

Code

kwic_ci <- quanteda::kwic(
  quanteda::tokens(text),
  pattern          = quanteda::phrase("alice"),
  case_insensitive = TRUE
) |>
  as.data.frame()

# Compare case-sensitive vs case-insensitive counts
cat("Case-sensitive matches:", nrow(mykwic), "\n")

Case-sensitive matches: 386

Code

cat("Case-insensitive matches:", nrow(kwic_ci), "\n")

Case-insensitive matches: 386

Searching Multiple Patterns Simultaneously

Pass a character vector to pattern (with phrase()) to search for several expressions in one call:

Code

alice_variants <- c("poor Alice", "little Alice", "dear Alice")

kwic_variants <- quanteda::kwic(
  quanteda::tokens(text),
  pattern = quanteda::phrase(alice_variants)
) |>
  as.data.frame()

table(kwic_variants$keyword)


little Alice   poor Alice   Poor Alice 
           3           10            1

Exercises: Creating Concordances

Q7. You run kwic(tokens(text), pattern = phrase("the Hatter")) and get 37 results. You then run kwic(tokens(text), pattern = phrase("the hatter"), case_insensitive = TRUE) and get 42 results. What explains the 5 additional matches in the second run?

Q8. What is the difference between kwic(tokens(text), pattern = "poor Alice") and kwic(tokens(text), pattern = phrase("poor Alice"))?

Exporting Concordances

Section Overview

What you will learn: How to export concordances to Excel, CSV, and R’s native .rds format; how to use here() for portable file paths; and how to create formatted concordance tables for reports and presentations

Exporting to Excel

Excel is the most widely compatible format for sharing concordances with colleagues who do not use R:

Code

writexl::write_xlsx(mykwic, here::here("output", "alice_concordance.xlsx"))

Using here() for File Paths

The here package constructs file paths relative to your RStudio project root, making your code portable across operating systems and user accounts:

# Hard-coded path — breaks on any other machine
write_xlsx(data, "C:/Users/Martin/Documents/project/output/file.xlsx")

# here() path — works anywhere the project folder is
write_xlsx(data, here::here("output", "file.xlsx"))

Always use here::here() in scripts you intend to share.

Other Export Formats

Code

# CSV — universal plain-text format, ideal for version control
write.csv(mykwic, here::here("output", "concordance.csv"), row.names = FALSE)

# Tab-separated — handles commas in text better than CSV
write.table(mykwic, here::here("output", "concordance.tsv"),
            sep = "\t", row.names = FALSE)

# R native format — preserves all R object attributes, smallest file size
saveRDS(mykwic, here::here("output", "concordance.rds"))

# Reload later
mykwic_reloaded <- readRDS(here::here("output", "concordance.rds"))

Format	Best for
`.xlsx`	Sharing with non-R users; easy manual annotation
`.csv`	Plain-text exchange; version control (Git-friendly)
`.tsv`	Large texts with commas in content
`.rds`	R-to-R sharing; preserves object class and attributes

Formatted Tables for Reports

For presentations and papers, use flextable to create publication-ready concordance displays:

Code

mykwic |>
  head(10) |>
  dplyr::select(pre, keyword, post) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .95, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 10) |>
  flextable::set_caption(caption = "Concordance of 'Alice' — first 10 instances.") |>
  flextable::border_outer()

pre	keyword	post
Down the Rabbit-Hole	Alice	was beginning to get very
a book , ” thought	Alice	“ without pictures or conversations
in that ; nor did	Alice	think it so _very_ much
and then hurried on ,	Alice	started to her feet ,
In another moment down went	Alice	after it , never once
down , so suddenly that	Alice	had not a moment to
“ Well ! ” thought	Alice	to herself , “ after
for , you see ,	Alice	had learnt several things of
got to ? ” (	Alice	had no idea what Latitude
else to do , so	Alice	soon began talking again .

Regular Expressions for Pattern Matching

Section Overview

What you will learn: What regular expressions are and why they are essential for flexible concordancing; the three main categories of regex operator (frequency quantifiers, character classes, position anchors); how to use valuetype = "regex" in kwic(); and how to apply regex to find morphological word families, words of specific structures, and complex patterns

What Are Regular Expressions?

A regular expression (regex) is a sequence of characters that describes a search pattern. Where a literal search finds only the exact string specified, a regex can match an entire family of strings defined by a structural rule. For concordancing, this means that instead of running separate searches for walk, walks, walked, and walking, a single regex \\bwalk\\w* finds all of them at once.

Regular expressions operate through three main types of operator:

Frequency Quantifiers

These control how many times a unit must appear:

Symbol	Meaning	Example
?	Preceding item is optional (0 or 1 times)	colou?r → colour, color
*	Preceding item appears 0 or more times	walk\w* → walk, walks, walked, walking
+	Preceding item appears 1 or more times	walk\w+ → walks, walked, walking (not bare walk)
{n}	Preceding item appears exactly n times	\w{5} → any 5-letter word
{n,}	Preceding item appears at least n times	\w{5,} → words of 5 or more letters
{n,m}	Preceding item appears between n and m times	\w{4,6} → words of 4, 5 or 6 letters

Character Classes

These represent sets of characters:

Symbol	Meaning
[ab]	Literal a or b
[A-Z]	Any uppercase letter A through Z
[0-9]	Any digit 0 through 9
[:digit:]	Any digit (equivalent to [0-9])
[:lower:]	Any lowercase letter
[:upper:]	Any uppercase letter
[:alpha:]	Any letter (upper or lower)
[:alnum:]	Any letter or digit
[:punct:]	Any punctuation character
.	Any single character (wildcard)

Position Anchors

These constrain where in a string the pattern must match:

Symbol	Meaning	Example
\\b	Word boundary (between a word character and a non-word character)	\\brun\\b matches run but not running or rerun
\\B	Non-boundary (inside a word, between two word characters)	\\Brun\\B matches the run in rerunning but not standalone run
^	Start of the string	^Alice matches Alice only at the very start of the text
$	End of the string	Alice$ matches Alice only at the very end of the text

Using Regex in kwic()

Set valuetype = "regex" to activate regular expression matching:

Code

# Find all words beginning with "alic" OR "hatt"
kwic_regex <- quanteda::kwic(
  quanteda::tokens(text),
  pattern   = "\\b(alic|hatt)\\w*",
  valuetype = "regex"
) |>
  as.data.frame()

table(kwic_regex$keyword)


   Alice  Alice’s   hatter   Hatter Hatter’s  hatters 
     386       10        1       54        1        1

Pattern breakdown:

\\b — word boundary: the match must begin at the start of a word
(alic|hatt) — alternation: either “alic” or “hatt”
\\w* — zero or more word characters: captures any continuation of the stem

docname	from	to	pre	keyword	post	pattern
text1	4	4	Down the Rabbit-Hole	Alice	was beginning to get very	\b(alic\|hatt)\w*
text1	63	63	a book , ” thought	Alice	“ without pictures or conversations	\b(alic\|hatt)\w*
text1	143	143	in that ; nor did	Alice	think it so _very_ much	\b(alic\|hatt)\w*
text1	229	229	and then hurried on ,	Alice	started to her feet ,	\b(alic\|hatt)\w*
text1	299	299	In another moment down went	Alice	after it , never once	\b(alic\|hatt)\w*
text1	338	338	down , so suddenly that	Alice	had not a moment to	\b(alic\|hatt)\w*
text1	521	521	“ Well ! ” thought	Alice	to herself , “ after	\b(alic\|hatt)\w*
text1	647	647	for , you see ,	Alice	had learnt several things of	\b(alic\|hatt)\w*

Common Regex Concordance Patterns

Code

# All morphological forms of "think"
kwic(tokens(text), pattern = "\\bthink\\w*", valuetype = "regex")
# Matches: think, thinks, thinking, thinker, thought (if you add |thought)

# Words ending in "-tion"
kwic(tokens(text), pattern = "\\w+tion\\b", valuetype = "regex")

# Words ending in "-ing"
kwic(tokens(text), pattern = "\\w+ing\\b", valuetype = "regex")

# Exactly four-letter words
kwic(tokens(text), pattern = "\\b\\w{4}\\b", valuetype = "regex")

# Words of ten or more letters
kwic(tokens(text), pattern = "\\b\\w{10,}\\b", valuetype = "regex")

# Words beginning with un- (negative prefix)
kwic(tokens(text), pattern = "\\bun\\w+", valuetype = "regex")

# Words beginning with a vowel
kwic(tokens(text), pattern = "\\b[aeiou]\\w*", valuetype = "regex")

Test Regex Patterns Before Running

Regular expressions can be deceptively fragile. Small errors produce patterns that match either too much (returning thousands of false positives) or too little (missing the intended targets silently). Best practice is to test your pattern on a small sample using stringr::str_detect() or an online regex tester (e.g., regex101.com) before running it on a full corpus. Always inspect a random sample of the results to verify the pattern is working as intended.

Exercises: Regular Expressions

Q9. A researcher wants to find all words in the Alice text that begin with the prefix re- (such as return, repeat, remember). She writes the pattern "re\\w*" with valuetype = "regex". What is the problem with this pattern, and how should it be fixed?

Q10. You want to find all words in the text that end in either -ful or -less (e.g. careful, careless, wonderful, hopeless). Write the regex pattern you would use.

Filtering and Sorting Concordances

Section Overview

What you will learn: How to use dplyr pipelines to filter concordance output by context patterns; how to sort concordances alphabetically and by collocate frequency; how to extract the immediate left and right neighbours of a keyword; and why sorting and filtering are essential for moving from raw concordance output to linguistic insight

Why Filtering and Sorting Matter

A raw concordance of a high-frequency word in a full-length novel may contain hundreds or thousands of lines. The analytical work of concordancing only begins once the raw output has been filtered to relevant instances and sorted in ways that group similar contexts together. Two operations are central to this:

Filtering restricts the concordance to lines meeting a specified context condition — for example, instances of said preceded by a character name, or instances of very followed by an adjective. Filtering uses dplyr::filter() in combination with stringr::str_detect().

Sorting reorders the concordance lines to surface patterns. Alphabetical sorting by the word immediately following the keyword groups lines that share the same collocate. Frequency sorting prioritises the most common collocates, immediately revealing the strongest patterns in the data.

Filtering by Context Pattern

The following pipeline finds instances of alice only when the immediately preceding word is poor or little:

Code

kwic_filtered <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "alice",
  case_insensitive = TRUE
) |>
  as.data.frame() |>
  # Keep only lines where the last word of the left context is "poor" or "little"
  dplyr::filter(stringr::str_detect(pre, "(poor|little)$"))

nrow(kwic_filtered)

[1] 13

docname	from	to	pre	keyword	post	pattern
text1	1,542	1,542	through , ” thought poor	Alice	, “ it would be	alice
text1	1,725	1,725	” but the wise little	Alice	was not going to do	alice
text1	2,131	2,131	but , alas for poor	Alice	! when she got to	alice
text1	2,333	2,333	now , ” thought poor	Alice	, “ to pretend to	alice
text1	3,605	3,605	words , ” said poor	Alice	, and her eyes filled	alice
text1	6,877	6,877	it ! ” pleaded poor	Alice	. “ But you’re so	alice
text1	7,291	7,291	! ” And here poor	Alice	began to cry again ,	alice
text1	8,240	8,240	home , ” thought poor	Alice	, “ when one wasn’t	alice
text1	11,789	11,789	it ! ” pleaded poor	Alice	in a piteous tone .	alice
text1	19,142	19,142	This answer so confused poor	Alice	, that she let the	alice

Pattern anatomy: str_detect(pre, "(poor|little)$") checks whether the pre column (the left context string) ends with ($) either poor or little. The $ anchor is important: without it, the filter would also match lines where poor or little appears earlier in the left context but is not the immediately preceding word.

Multiple Conditions

Conditions can be combined with & (AND) or | (OR):

Code

# Find "said" preceded by a character name AND followed by a comma or period
kwic_said <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "said"
) |>
  as.data.frame() |>
  dplyr::filter(
    stringr::str_detect(pre,  "(Alice|Hatter|Queen|King|Cat|Mouse)$"),
    stringr::str_detect(post, "^[,.]")
  )

head(kwic_said, 8)

  docname  from    to                    pre keyword
1   text1 15850 15850  direction , ” the Cat    said
2   text1 17618 17618     yet ? ” the Hatter    said
3   text1 17747 17747   don’t ! ” the Hatter    said
4   text1 30643 30643      must , ” the King    said
5   text1 31312 31312 important , ” the King    said
6   text1 32748 32748   verdict , ” the King    said
                               post pattern
1            , waving its right paw    said
2          , turning to Alice again    said
3 , tossing his head contemptuously    said
4           , with a melancholy air    said
5             , turning to the jury    said
6         , for about the twentieth    said

This pipeline extracts speech acts of the form “[Character name] said, …” — a common construction in narrative fiction.

Alphabetical Sorting

Sorting alphabetically by the right context groups together lines with the same immediately following word, making collocational patterns immediately visible:

Code

kwic_sorted_alpha <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "alice",
  case_insensitive = TRUE
) |>
  as.data.frame() |>
  dplyr::arrange(post)

head(kwic_sorted_alpha, 8)

  docname  from    to                     pre keyword                     post
1   text1  7754  7754       happen : “ ‘ Miss   Alice   ! Come here directly ,
2   text1  2888  2888  the garden door . Poor   Alice         ! It was as much
3   text1  2131  2131     but , alas for poor   Alice        ! when she got to
4   text1 30891 30891      voice , the name “   Alice        ! ” CHAPTER XII .
5   text1  8423  8423      “ Oh , you foolish   Alice ! ” she answered herself
6   text1  2606  2606 and curiouser ! ” cried   Alice        ( she was so much
7   text1 25861 25861      I haven’t , ” said   Alice        ) — “ and perhaps
8   text1 32275 32275     explain it , ” said   Alice        , ( she had grown
  pattern
1   alice
2   alice
3   alice
4   alice
5   alice
6   alice
7   alice
8   alice

Frequency Sorting

Frequency sorting identifies the most common collocates — the words that appear most often immediately before or after the keyword:

Code

kwic_sorted_freq <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "alice",
  case_insensitive = TRUE
) |>
  as.data.frame() |>
  # Extract the first word of the right context
  dplyr::mutate(post1 = stringr::str_extract(post, "^\\w+")) |>
  # Count how often each right-context word occurs
  dplyr::add_count(post1, name = "post1_freq") |>
  # Sort from most to least frequent
  dplyr::arrange(dplyr::desc(post1_freq))

head(kwic_sorted_freq |> dplyr::select(pre, keyword, post, post1, post1_freq), 12)

                        pre keyword                                post post1
1        a book , ” thought   Alice “ without pictures or conversations  <NA>
2  through , ” thought poor   Alice                     , “ it would be  <NA>
3      here before , ” said   Alice                   , ) and round the  <NA>
4  curious feeling ! ” said   Alice                       ; “ I must be  <NA>
5       but , alas for poor   Alice                   ! when she got to  <NA>
6      now , ” thought poor   Alice                   , “ to pretend to  <NA>
7           eat it , ” said   Alice                       , “ and if it  <NA>
8   and curiouser ! ” cried   Alice                   ( she was so much  <NA>
9       to them , ” thought   Alice                 , “ or perhaps they  <NA>
10   the garden door . Poor   Alice                    ! It was as much  <NA>
11     of yourself , ” said   Alice                    , “ a great girl  <NA>
12      words , ” said poor   Alice               , and her eyes filled  <NA>
   post1_freq
1         163
2         163
3         163
4         163
5         163
6         163
7         163
8         163
9         163
10        163
11        163
12        163

Code

# Summary table: top 10 right-context collocates
kwic_sorted_freq |>
  dplyr::distinct(post1, post1_freq) |>
  dplyr::arrange(dplyr::desc(post1_freq)) |>
  head(10) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .4, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 11) |>
  flextable::set_caption(caption = "Top 10 words immediately following 'alice'.") |>
  flextable::border_outer()

post1	post1_freq
	163
was	17
thought	12
had	11
said	11
could	11
replied	9
did	9
looked	8
to	7

Extracting N-gram Collocates

It is often useful to extract not just the immediately adjacent word but the first two or three words on each side, building a more complete collocational profile:

Code

kwic_ngram <- quanteda::kwic(
  x       = quanteda::tokens(text),
  pattern = "alice",
  case_insensitive = TRUE,
  window  = 5
) |>
  as.data.frame() |>
  dplyr::rowwise() |>
  dplyr::mutate(
    post1 = stringr::str_split(post, " ")[[1]][1],
    post2 = stringr::str_split(post, " ")[[1]][2],
    pre1  = dplyr::last(stringr::str_split(pre, " ")[[1]]),
    pre2  = rev(stringr::str_split(pre, " ")[[1]])[2]
  ) |>
  dplyr::ungroup()

# Most common bigrams following "alice"
kwic_ngram |>
  dplyr::count(post1, post2, sort = TRUE) |>
  head(10) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .4, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 11) |>
  flextable::set_caption(caption = "Most frequent bigrams immediately following 'alice'.") |>
  flextable::border_outer()

post1	post2	n
.	“	50
,	“	20
;	“	13
,	and	9
did	not	9
,	as	6
,	who	6
:	“	6
in	a	6
to	herself	6

Exercises: Filtering and Sorting

Q11. You extract a concordance of the word thought in the Alice text and want to find only instances where the preceding context ends with she (i.e. “she thought”). Which dplyr::filter() call achieves this?

Q12. After sorting a concordance of very by frequency of the immediately following word, you find that much is the most frequent right-context collocate. What does this tell you about the grammatical behaviour of very in the text, and what follow-up analysis might you do?

Working with Spoken Transcripts

Section Overview

What you will learn: How spoken language transcripts differ structurally from written text; how to load and preprocess ICE-Ireland transcript files; how to handle annotation markup in concordancing; and how to extract and analyse discourse markers in spoken data

Characteristics of Spoken Transcripts

Spoken language transcripts differ from written texts in several ways that require adapted preprocessing:

Speaker turn structure — transcripts are organised by speaker turns, often with speaker IDs encoded in annotation tags.

Paralinguistic markers — laughter, coughing, pauses, and overlaps are typically encoded using special markup: <,> for pauses, <&> laughter </&> for paralinguistic events.

Incomplete and non-standard forms — spoken language contains false starts, filled pauses (uh, um), and incomplete utterances that do not correspond to standard written sentences.

Metadata headers — corpus transcripts typically begin with file-level metadata: recording date, speaker demographics, and topic information.

These features mean that the same preprocessing pipeline used for written texts will not work cleanly for transcripts. The approach must be adapted to the specific annotation conventions of the corpus being used.

Loading ICE-Ireland Transcripts

We work with a sample of five files from the spoken dialogue section of the International Corpus of English — Irish component:

Code

files <- paste0("tutorials/kwics/data/ICEIrelandSample/S1A-00", 1:5, ".txt")

transcripts <- sapply(files, readLines, USE.NAMES = TRUE)

Code

# Inspect the first 12 lines of the first file
transcripts[[1]][1:12]

 [1] "<S1A-001 Riding>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
 [2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
 [3] "<I>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [4] "<S1A-001$A> <#> Well how did the riding go tonight"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [5] "<S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&>"                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [6] "<S1A-001$A> <#> What did you call your horse"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 [7] "<S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
 [8] "<S1A-001$A> <#> And how did Mabel do"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
 [9] "<S1A-001$B> <#> Did you not see her whenever she was going over the jumps <#> There was one time her horse refused and it refused three times <#> And then <,> she got it round and she just lined it up straight and she just kicked it and she hit it with the whip <,> and over it went the last time you know <#> And Stephanie told her she was very determined and very well-ridden <&> laughter </&> because it had refused the other times you know <#> But Stephanie wouldn't let her give up on it <#> She made her keep coming back and keep coming back <,> until <,> it jumped it you know <#> It was good"
[10] "<S1A-001$A> <#> Yeah I 'm not so sure her jumping 's improving that much <#> She uh <,> seemed to be holding the reins very tight"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[11] "<S1A-001$B> <#> Yeah she was <#> That 's what Stephanie said <#> <{> <[> She </[> needed to <,> give the horse its head"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
[12] "<S1A-001$A> <#> <[> Mm </[> </{>"

The output reveals the ICE annotation conventions:

<S1A-001 Riding> — file header with ID and title
<I> — transcript start marker
<S1A-001$A> — speaker A in file 001
<#> — speech unit boundary
<,> — pause
<&> laughter </&> — paralinguistic event

Collapsing and Cleaning Transcripts

For basic concordancing we collapse each transcript to a single string and normalise whitespace. We retain the annotation tags for now — they can be used later to extract speaker-specific data:

Code

transcripts_collapsed <- sapply(transcripts, function(x) {
  x |>
    paste0(collapse = " ") |>
    stringr::str_squish()
})

# Preview the first 400 characters of the first transcript
substr(transcripts_collapsed[[1]], 1, 400)

[1] "<S1A-001 Riding> <I> <S1A-001$A> <#> Well how did the riding go tonight <S1A-001$B> <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter </&> <S1A-001$A> <#> What did you call your horse <S1A-001$B> <#> I can't remember <#> Oh Mary 's Town <,> oh <S1A-001$A> <#> And how did Mabel do <S1A-0"

Concordancing Transcripts

We search for the discourse marker you know, a high-frequency pragmatic expression in spoken Irish English used to signal common ground, manage turn-taking, and mark informational status:

Code

kwic_youknow <- quanteda::kwic(
  # Use "fasterword" tokeniser to preserve annotation tags as tokens
  quanteda::tokens(transcripts_collapsed, what = "fasterword"),
  pattern = quanteda::phrase("you know"),
  window  = 10
) |>
  as.data.frame() |>
  # Clean up document names to show only the file ID
  dplyr::mutate(docname = stringr::str_extract(docname, "S1A-\\d+"))

docname	from	to	pre	keyword	post	pattern
S1A-001	42	43	let me jump <,> that was only the fourth time	you know	<#> It was great <&> laughter </&> <S1A-001$A> <#> What	you know
S1A-001	140	141	the whip <,> and over it went the last time	you know	<#> And Stephanie told her she was very determined and	you know
S1A-001	164	165	<&> laughter </&> because it had refused the other times	you know	<#> But Stephanie wouldn't let her give up on it	you know
S1A-001	193	194	and keep coming back <,> until <,> it jumped it	you know	<#> It was good <S1A-001$A> <#> Yeah I 'm not	you know
S1A-001	402	403	'd be far better waiting <,> for that one <,>	you know	and starting anew fresh <S1A-001$A> <#> Yeah but I mean	you know
S1A-001	443	444	the best goes top of the league <,> <{> <[>	you know	</[> <S1A-001$A> <#> <[> So </[> </{> it 's like	you know
S1A-001	484	485	I 'm not sure now <#> We didn't discuss it	you know	<S1A-001$A> <#> Well it sounds like more money <S1A-001$B> <#>	you know
S1A-001	598	599	on Monday and do without her lesson on Tuesday <,>	you know	<#> But I was keeping her going cos I says	you know
S1A-001	727	728	to take it tomorrow <,> that she could take her	you know	the wee shoulder bag she has <S1A-001$A> <#> Mhm <S1A-001$B>	you know
S1A-001	808	809	<,> and <,> sort of show them around <,> uhm	you know	their timetable and <,> give them their timetable and show	you know

Why what = "fasterword"?

The default quanteda::tokens() tokeniser strips punctuation and special characters. For annotated transcripts, this would remove the <#>, <,>, and speaker tags — information that may be important for the analysis. what = "fasterword" tokenises by whitespace only, preserving tags as tokens. A wider context window (here, 10) is also appropriate for spoken data because pauses, tags, and fillers occupy tokens within the window.

Distribution Across Files

We can compare the frequency of you know across the five files:

Code

kwic_youknow |>
  dplyr::count(docname, name = "n_youknow") |>
  dplyr::arrange(dplyr::desc(n_youknow)) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .35, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 11) |>
  flextable::set_caption(caption = "Frequency of 'you know' per file.") |>
  flextable::border_outer()

docname	n_youknow
S1A-001	18
S1A-005	15
S1A-002	14
S1A-004	14
S1A-003	13

Filtering Out Annotation Tags

For some analyses, it is preferable to work with clean text from which all annotation markup has been removed. We can strip tags before concordancing:

Code

transcripts_clean <- sapply(transcripts_collapsed, function(x) {
  x |>
    # Remove all XML-style tags
    stringr::str_remove_all("<[^>]+>") |>
    # Remove multiple spaces left by tag removal
    stringr::str_squish()
})

substr(transcripts_clean[[1]], 1, 300)

[1] "Well how did the riding go tonight It was good so it was Just I I couldn't believe that she was going to let me jump that was only the fourth time you know It was great laughter What did you call your horse I can't remember Oh Mary 's Town oh And how did Mabel do Did you not see her whenever she was"

Code

kwic_yk_clean <- quanteda::kwic(
  quanteda::tokens(transcripts_clean),
  pattern = quanteda::phrase("you know"),
  window  = 7
) |>
  as.data.frame()

head(kwic_yk_clean, 5)

                                            docname from  to
1 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt   33  34
2 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt  115 116
3 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt  136 137
4 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt  161 162
5 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt  235 236
                                     pre  keyword
1     jump that was only the fourth time you know
2         and over it went the last time you know
3 because it had refused the other times you know
4    keep coming back until it jumped it you know
5      jumping for quite a few weeks now you know
                                    post  pattern
1     It was great laughter What did you you know
2    And Stephanie told her she was very you know
3 But Stephanie wouldn't let her give up you know
4                 It was good Yeah I ' m you know
5   any proper jumping really And so she you know

Exercises: Spoken Transcripts

Q13. You are studying hedging in spoken Irish English and want to find all instances of kind of and sort of in the ICE transcripts. Write the kwic() call that would achieve this using both the annotated and cleaned versions of the transcripts.

Q14. After running a concordance of you know on the annotated transcripts, you notice that many concordance lines contain annotation tags like <#> and <,> in the context windows. A colleague says you should always remove the tags before concordancing. Do you agree?

A Custom Concordance Function

Section Overview

What you will learn: How quanteda::kwic() works internally; how to build an improved custom concordance function with character-based (rather than token-based) context windows, structured output with named columns, and input validation; and when a custom function provides capabilities that kwic() does not

Why Build a Custom Function?

quanteda::kwic() is an excellent, well-tested concordancer for the vast majority of use cases. There are, however, situations where a custom function is useful:

Character-based context windows — kwic() measures context in tokens. For some analyses (e.g. studies of typographic or layout features, or working with very short texts), a character-count window is more appropriate.
Fine-grained output control — a custom function can return exactly the columns and naming conventions your workflow requires without post-hoc renaming.
Educational transparency — writing the function from scratch makes the mechanics of concordancing explicit and demystifiable.
Integration with non-standard inputs — for data sources that do not fit naturally into quanteda’s corpus/tokens workflow, a character-level function may be simpler.

The Improved Custom Function

The function below improves on the approach in the original draft in several ways:

Input validation — it checks that the text and pattern are non-empty and the context length is positive, issuing informative error messages if not.
Vectorised output — it returns a properly typed tibble rather than a matrix-derived data frame.
Named, clean columns — Left, Node, Right, DocID, and MatchID are returned for clarity.
Multiple document support — the function accepts a named vector and records the source document for each hit.
Safe handling of edge positions — matches near the start or end of a text are handled without errors by clamping the window to the text boundaries.

Code

concordance <- function(texts, pattern, context = 80,
                        ignore_case = FALSE, perl = TRUE) {
  # ── Input validation ──────────────────────────────────────────────────────
  if (!is.character(texts) || length(texts) == 0)
    stop("`texts` must be a non-empty character vector.")
  if (!is.character(pattern) || length(pattern) != 1 || nchar(pattern) == 0)
    stop("`pattern` must be a single non-empty character string (regex allowed).")
  if (!is.numeric(context) || context < 1)
    stop("`context` must be a positive integer (number of characters per side).")

  context <- as.integer(context)

  # ── Ensure texts are named (for DocID column) ─────────────────────────────
  if (is.null(names(texts))) names(texts) <- paste0("doc", seq_along(texts))

  # ── Process each document ─────────────────────────────────────────────────
  results <- purrr::imap(texts, function(txt, doc_id) {

    # Find all match positions
    m <- gregexpr(pattern, txt,
                  ignore.case = ignore_case,
                  perl        = perl)[[1]]

    # No matches in this document
    if (m[1] == -1) return(NULL)

    match_starts  <- as.integer(m)
    match_lengths <- attr(m, "match.length")
    match_ends    <- match_starts + match_lengths - 1L
    txt_len       <- nchar(txt)

    purrr::pmap_dfr(
      list(match_starts, match_ends, seq_along(match_starts)),
      function(ms, me, idx) {

        # Character-based window, clamped to text boundaries
        left_start  <- max(1L,        ms - context)
        right_end   <- min(txt_len,   me + context)

        left  <- substr(txt, left_start, ms - 1L)
        node  <- substr(txt, ms,         me)
        right <- substr(txt, me + 1L,    right_end)

        # Trim to nearest word boundary for cleaner display
        left  <- stringr::str_remove(left,  "^\\S*\\s")   # remove partial first word
        right <- stringr::str_remove(right, "\\s\\S*$")   # remove partial last word

        tibble::tibble(
          DocID   = doc_id,
          MatchID = idx,
          Left    = left,
          Node    = node,
          Right   = right
        )
      }
    )
  })

  # ── Combine and return ────────────────────────────────────────────────────
  out <- dplyr::bind_rows(results)

  if (nrow(out) == 0) {
    message("No matches found for pattern: ", pattern)
    return(tibble::tibble(DocID   = character(),
                          MatchID = integer(),
                          Left    = character(),
                          Node    = character(),
                          Right   = character()))
  }

  out
}

Using the Custom Function

Code

# Search for "you know" with 60-character context windows
kwic_custom <- concordance(
  texts   = transcripts_collapsed,
  pattern = "you know",
  context = 60
)

nrow(kwic_custom)

[1] 62

Code

head(kwic_custom, 6)

# A tibble: 6 × 5
  DocID                                             MatchID Left     Node  Right
  <chr>                                               <int> <chr>    <chr> <chr>
1 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       1 "was go… you … " <#…
2 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       2 "hit it… you … " <#…
3 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       3 "<&> la… you … " <#…
4 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       4 "back a… you … " <#…
5 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       5 "I said… you … " an…
6 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       6 "<,> wh… you … " </…

DocID	MatchID	Left	Node	Right
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt	1	was going to let me jump <,> that was only the fourth time	you know	<#> It was great <&> laughter </&> <S1A-001$A> <#> What
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt	2	hit it with the whip <,> and over it went the last time	you know	<#> And Stephanie told her she was very determined and
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt	3	<&> laughter </&> because it had refused the other times	you know	<#> But Stephanie wouldn't let her give up on it <#> She
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt	4	back and keep coming back <,> until <,> it jumped it	you know	<#> It was good <S1A-001$A> <#> Yeah I 'm not so sure her
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt	5	I said she 'd be far better waiting <,> for that one <,>	you know	and starting anew fresh <S1A-001$A> <#> Yeah but I mean
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt	6	<,> whoever 's the best goes top of the league <,> <{> <[>	you know	</[> <S1A-001$A> <#> <[> So </[> </{> it 's like another
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt	7	I got <#> I 'm not sure now <#> We didn't discuss it	you know	<S1A-001$A> <#> Well it sounds like more money <S1A-001$B>
tutorials/kwics/data/ICEIrelandSample/S1A-001.txt	8	go on Monday and do without her lesson on Tuesday <,>	you know	<#> But I was keeping her going cos I says oh I wouldn't

Regex Patterns and Case Insensitivity

The function accepts any R regular expression in its pattern argument:

Code

# Find "kind of" and "sort of" (hedges), case-insensitive
kwic_hedges <- concordance(
  texts       = transcripts_clean,
  pattern     = "\\b(kind|sort) of\\b",
  context     = 70,
  ignore_case = TRUE
)

nrow(kwic_hedges)

[1] 19

Code

head(kwic_hedges, 5)

# A tibble: 5 × 5
  DocID                                             MatchID Left     Node  Right
  <chr>                                               <int> <chr>    <chr> <chr>
1 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       1 "or any… sort… " sh…
2 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       2 "months… sort… " fl…
3 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       3 "them f… sort… " wa…
4 tutorials/kwics/data/ICEIrelandSample/S1A-001.txt       4 "of the… sort… " wa…
5 tutorials/kwics/data/ICEIrelandSample/S1A-002.txt       1 "Mass l… kind… " lo…

Comparing Token-Based and Character-Based Windows

The key practical difference between kwic() and our custom function is the unit of the context window:

Code

# quanteda: 7-token window
kwic_token <- quanteda::kwic(
  quanteda::tokens(transcripts_clean[[1]]),
  pattern = quanteda::phrase("you know"),
  window  = 7
) |>
  as.data.frame() |>
  head(3) |>
  dplyr::mutate(type = "token-based (7 tokens)")

# Custom: 50-character window
kwic_char <- concordance(
  texts   = transcripts_clean[1],
  pattern = "you know",
  context = 50
) |>
  head(3) |>
  dplyr::mutate(type = "char-based (50 chars)")

# Show the difference
cat("=== Token-based (7 tokens) ===\n")

=== Token-based (7 tokens) ===

Code

kwic_token[1, c("pre", "keyword", "post")]

                                 pre  keyword
1 jump that was only the fourth time you know
                                post
1 It was great laughter What did you

Code

cat("\n=== Character-based (50 chars) ===\n")


=== Character-based (50 chars) ===

Code

kwic_char[1, c("Left", "Node", "Right")]

# A tibble: 1 × 3
  Left                                            Node     Right                
  <chr>                                           <chr>    <chr>                
1 "to let me jump that was only the fourth time " you know " It was great laugh…

With a token-based window, a text full of short function words will show more words on each side than a text with many long technical terms — because a token is a token regardless of length. With a character-based window, the amount of displayed text is constant regardless of token length, which can be preferable when the visual appearance of the concordance matters.

Exercises: Custom Concordance Function

Q15. The custom concordance() function uses gregexpr() to find match positions and then calls substr() to extract context. What would happen if you removed the line that trims to the nearest word boundary, and why was that step added?

Q16. You want to use the custom concordance() function to find all instances of British English -ise spellings (e.g. recognise, organise, realise) in a corpus of newspaper editorials. Write the function call you would use.

Reproducible Workflows

Section Overview

What you will learn: How to organise a concordancing project for reproducibility; essential script documentation conventions; how to parameterise analyses for easy modification; and how to export results in formats that support open science

Project Structure

A well-organised project folder makes it easy to return to an analysis months later, share it with collaborators, or submit it alongside a manuscript:

my-concordance-project/
├── data/
│   ├── raw/              # Original texts — never edit these
│   └── processed/        # Cleaned texts and intermediate objects
├── scripts/
│   ├── 01-load-clean.R   # Data loading and preprocessing
│   ├── 02-concordance.R  # KWIC extraction
│   └── 03-analysis.R     # Filtering, sorting, statistics
├── output/
│   ├── concordances/     # Saved KWIC tables
│   └── figures/          # Plots and visualisations
├── README.md             # Project description
└── project.Rproj         # RStudio project file

Script Documentation

A reproducible script begins with a header block and uses parameters rather than hard-coded values:

Code

# ============================================================
# Title:   Concordance analysis of hedging in Alice in Wonderland
# Author:  Martin Schweinberger
# Date:    2026-05-01
# Purpose: Extract and analyse instances of epistemic hedges
# ============================================================

library(quanteda)
library(dplyr)
library(writexl)
library(here)

# ── Parameters (change these, not the code below) ───────────
CONTEXT_WINDOW <- 7          # tokens per side
MIN_FREQ       <- 3          # minimum collocate frequency to report
OUTPUT_DIR     <- here::here("output", "concordances")

# ── Load and clean text ──────────────────────────────────────
load("tutorials/kwics/data/alice.rda")   # loads object: rawtext
text_clean <- paste0(rawtext, collapse = " ") |>
  stringr::str_squish() |>
  stringr::str_remove(".*CHAPTER I\\.")

# ── Extract concordances ─────────────────────────────────────
kwic_hedge <- quanteda::kwic(
  quanteda::tokens(text_clean),
  pattern = quanteda::phrase(c("perhaps", "might", "seemed to",
                                "appeared to", "as if")),
  window  = CONTEXT_WINDOW
) |>
  as.data.frame()

# ── Save results ──────────────────────────────────────────────
dir.create(OUTPUT_DIR, showWarnings = FALSE, recursive = TRUE)
writexl::write_xlsx(
  kwic_hedge,
  file.path(OUTPUT_DIR, paste0("hedges_kwic_", Sys.Date(), ".xlsx"))
)

# ── Session info for reproducibility ─────────────────────────
sessionInfo()

Key practices demonstrated here:

Parameterisation — CONTEXT_WINDOW, MIN_FREQ, and OUTPUT_DIR are defined at the top so the entire analysis can be reconfigured by changing three lines.
Datestamped output — Sys.Date() in the filename means each run creates a new output file, preserving the history of the analysis.
sessionInfo() at the end — records the exact R and package versions used, essential for replication.

Common Mistakes and How to Avoid Them

Top Five Concordancing Pitfalls

1. Forgetting case sensitivity. kwic(tokens(text), phrase("alice")) returns zero results if the text uses “Alice”. Always check capitalisation conventions in your data, and use case_insensitive = TRUE when appropriate.

2. Using regex without valuetype = "regex". The pattern "walk.*" without valuetype = "regex" is treated as a literal glob pattern, not a regular expression. The result may be zero matches or unexpected results.

3. Skipping preprocessing. Running kwic() on uncleaned text means metadata, headers, and formatting artefacts contaminate the concordance.

4. Not inspecting results. Always examine a random sample of concordance lines to verify the pattern matched what you intended. Use sample_n(20) to spot-check.

5. Confusing phrase() and bare strings. Always use phrase() for multi-word search targets. Without it, quanteda may treat the multi-word string as a glob matching individual tokens.

Summary and Further Reading

This tutorial has provided a comprehensive introduction to concordancing with R, covering the conceptual foundations, the tool landscape, and the full practical workflow from loading text to exporting results.

Section 1 established what concordancing is — the systematic KWIC-display extraction of words in context — and why it is central to corpus linguistics: it grounds linguistic claims in observable, verifiable evidence, overcomes the limitations of native-speaker intuition and working memory, and enables both quantitative and qualitative analysis within a single framework.

Section 2 surveyed the concordancing tool landscape: desktop software (AntConc, WordSmith, SketchEngine), web corpus interfaces (COCA, COHA, Lextutor), and R with quanteda. The case for R rests on reproducibility, integration with statistics and visualisation, flexibility, and scalability.

Sections 3 and 4 covered data loading and preprocessing, with emphasis on the importance of cleaning (collapsing lines, normalising whitespace, removing headers) before concordancing.

Section 5 introduced quanteda::kwic() in depth: the structure of the KWIC output table, window size control, phrase search with phrase(), case-insensitive search, and multi-pattern search.

Section 6 introduced regular expressions as a tool for flexible concordancing: frequency quantifiers, character classes, position anchors, and common concordancing patterns including morphological word families, suffix-based searches, and word length filters.

Section 7 extended the workflow to filtering and sorting: using dplyr pipelines with str_detect() to restrict concordances to specific context conditions, and sorting by alphabetical and frequency criteria to surface collocational patterns.

Section 8 addressed spoken language transcripts, covering the structural differences between written and transcribed spoken text, how to handle annotation markup, and the discourse marker you know as a worked example.

Section 9 presented an improved custom concordance function using character-based context windows, input validation, and support for named multi-document input — extending kwic() for use cases that token-based windows cannot serve.

Section 10 demonstrated reproducible workflow conventions: project folder structure, parameterised scripts, datestamped output, and sessionInfo() documentation.

Further reading: Sinclair (1991) remains the foundational reference for concordance-based language description. McEnery and Hardie (2011) provides a comprehensive methodological overview of corpus linguistics. Stubbs (1996) is essential on collocation and semantic prosody. Anthony (2013) discusses tool choice for corpus work. Brezina (2018) covers the statistical analysis of concordance-derived collocations. For quanteda specifically, the package documentation and tutorials at tutorials.quanteda.io are the primary resource.

Citation & Session Info

@manual{schweinberger2026kwics,
  author       = {Schweinberger, Martin},
  title        = {Finding Words in Text: Concordancing with R},
  note         = {tutorials/kwics/kwics.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.05.01}
}

AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to substantially expand and restructure a shorter existing LADAL draft tutorial on concordancing. All content — including all R code — was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] ggplot2_4.0.2    checkdown_0.0.13 tidyr_1.3.2      flextable_0.9.11
[5] here_1.0.2       writexl_1.5.1    stringr_1.5.1    dplyr_1.2.0     
[9] quanteda_4.2.0  

loaded via a namespace (and not attached):
 [1] fastmatch_1.1-6         gtable_0.3.6            xfun_0.56              
 [4] htmlwidgets_1.6.4       lattice_0.22-6          vctrs_0.7.1            
 [7] tools_4.4.2             generics_0.1.3          tibble_3.2.1           
[10] pkgconfig_2.0.3         Matrix_1.7-2            data.table_1.17.0      
[13] RColorBrewer_1.1-3      S7_0.2.1                uuid_1.2-1             
[16] lifecycle_1.0.5         compiler_4.4.2          farver_2.1.2           
[19] textshaping_1.0.0       codetools_0.2-20        litedown_0.9           
[22] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.9        
[25] yaml_2.3.10             pillar_1.10.1           openssl_2.3.2          
[28] fontBitstreamVera_0.1.1 commonmark_2.0.0        stopwords_2.3          
[31] tidyselect_1.2.1        zip_2.3.2               digest_0.6.39          
[34] stringi_1.8.4           purrr_1.0.4             rprojroot_2.1.1        
[37] fastmap_1.2.0           grid_4.4.2              cli_3.6.4              
[40] magrittr_2.0.3          patchwork_1.3.0         utf8_1.2.4             
[43] withr_3.0.2             gdtools_0.5.0           scales_1.4.0           
[46] rmarkdown_2.30          officer_0.7.3           askpass_1.2.1          
[49] ragg_1.3.3              evaluate_1.0.3          knitr_1.51             
[52] markdown_2.0            rlang_1.1.7             Rcpp_1.1.1             
[55] glue_1.8.0              xml2_1.3.6              renv_1.1.7             
[58] rstudioapi_0.17.1       jsonlite_1.9.0          R6_2.6.1               
[61] systemfonts_1.3.1

Back to LADAL home

References

Anthony, Laurence. 2013. “A Critical Look at Software Tools in Corpus Linguistics.” Linguistic Research 30 (2).

Brezina, Vaclav. 2018. Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.

McEnery, Tony, and Andrew Hardie. 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press.

Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Stubbs, Michael. 1996. Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture. Blackwell Oxford.

--- title: "Finding Words in Text: Concordancing with R" author: "Martin Schweinberger" format: html: toc: true toc-depth: 4 code-fold: show code-tools: true theme: cosmo --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} options(stringsAsFactors = FALSE) options("scipen" = 100, "digits" = 4) library(checkdown) ``` ::: {.callout-note} ## Preparing the Tutorial Data File The Alice text must be downloaded once and saved before knitting. Run this code once in your console (not in the tutorial itself): ```r rawtext <- readLines("https://www.gutenberg.org/files/11/11-0.txt") dir.create("tutorials/kwics/data", recursive = TRUE, showWarnings = FALSE) save(rawtext, file = "tutorials/kwics/data/alice.rda") ``` This creates `tutorials/kwics/data/alice.rda`, which the tutorial loads at knit time. ::: ![](/images/uq1.jpg){ width=100% } # Introduction {#intro} ![](/images/gy_chili.png){ width=15% style="float:right; padding:10px" } This tutorial introduces **concordancing** — one of the most fundamental and powerful methods in corpus linguistics. Concordancing allows researchers to search systematically through large text collections, extracting every occurrence of a word or phrase together with the surrounding context. The resulting display, known as a **keyword-in-context (KWIC)** display, makes patterns of language use visible that would be impossible to detect through ordinary reading. The tutorial covers the core concepts of concordancing, a survey of available tools from desktop software to web interfaces to R, and a hands-on practical guide to extracting, filtering, sorting, and analysing concordances using the `quanteda` package. It includes a section on working with spoken language transcripts and demonstrates how to build a custom concordance function that extends `quanteda`'s built-in capabilities. ::: {.callout-note} ## Prerequisite Tutorials Before working through this tutorial, we recommend familiarity with: - [Getting Started with R](/tutorials/intror/intror.html) — R objects, basic syntax, RStudio orientation - [Loading and Saving Data in R](/tutorials/load/load.html) — reading files into R - [String Processing in R](/tutorials/string/string.html) — working with text using `stringr` - [Regular Expressions in R](/tutorials/regex/regex.html) — pattern matching with regex ::: ::: {.callout-note} ## Learning Objectives By the end of this tutorial you will be able to: 1. Explain what concordancing is, how the KWIC display works, and why it is central to corpus linguistics 2. Navigate the landscape of concordancing tools and choose the right tool for different research tasks 3. Load and preprocess text data for concordancing in R 4. Extract keyword-in-context concordances using `quanteda::kwic()` 5. Use regular expressions to search for morphological variants and complex patterns 6. Filter and sort concordances using `dplyr` pipelines to reveal collocational patterns 7. Work with spoken language transcripts and handle annotation markup 8. Build and use a custom concordance function for character-based context windows 9. Export concordances in Excel, CSV, and other formats for further analysis ::: ::: {.callout-note} ## Citation Schweinberger, Martin. 2026. *Finding Words in Text: Concordancing with R*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/kwics/kwics.html (Version 2026.05.01). ::: ::: {.callout-note} ## LADAL Notebook Tool An interactive Binder notebook that lets you upload your own texts and run the concordancing code without installing R is available here: ```{r binderurl, echo=FALSE, message=FALSE, warning=FALSE} base_url <- "https://binderhub.atap-binder.cloud.edu.au/v2/gh/SLCLADAL/interactive-notebooks-environment/main" repo_url <- "https://github.com/SLCLADAL/interactive-notebooks" notebook_path <- "interactive-notebooks/notebooks/kwictool.ipynb" repo_param <- URLencode(paste0("repo=", URLencode(repo_url, repeated = TRUE)), repeated = TRUE) notebook_param <- URLencode(paste0("urlpath=lab/tree/", notebook_path), repeated = TRUE) params <- paste0("urlpath=git-pull%3F", repo_param, "%26", notebook_param, "%26branch=main") binder_url <- paste0(base_url, "?", params) ``` ::: {.text-center} [![](https://mybinder.org/badge_logo.svg)](`r binder_url`) Click [**here**](`r binder_url`) to open the interactive concordancing notebook. ::: ::: --- # What Is Concordancing? {#concepts} ::: {.callout-note} ## Section Overview **What you will learn:** What a concordance is and how the KWIC display works; why concordancing is central to corpus linguistics; how concordances bridge quantitative and qualitative analysis; and a survey of application areas from linguistics to lexicography to digital humanities ::: ## The KWIC Display {-} A **concordance** is a systematic list of every occurrence of a search term in a text or corpus, presented with its surrounding context. The standard presentation format is the **keyword-in-context (KWIC)** display, in which the search term — the **node word** — appears aligned in the centre of each line, with a fixed window of context on either side: ``` ...couldn't help thinking there must be more to life than being merely ...you are my density. I must go. Please excuse me. I mean, my ...the situation requires that we must work together to achieve our aims. ...extraordinary claims require extraordinary evidence before they are ...but the Emperor has no clothes! must speak truth to power and be heard ``` This deceptively simple layout makes linguistic patterns visible that are impossible to detect through ordinary reading. By displaying multiple instances simultaneously and aligning the node word vertically, concordances allow researchers to: - observe **how words are actually used** rather than how we imagine they are used - identify **collocational patterns** — words that systematically appear nearby - distinguish **different meanings or senses** of a polysemous word from context - examine **grammatical constructions** that a word participates in - compare **register and genre variation** across different text types ## From Intuition to Evidence {-} One of the most important contributions of concordancing to linguistics is the systematic correction of native-speaker intuition. Speakers often hold confident but inaccurate beliefs about their own language use. Concordances provide **observable, verifiable evidence** that can confirm, nuance, or directly contradict such intuitions. Some well-documented intuition-corpus mismatches include: - speakers typically believe they use *very* more than *really*, but corpus evidence frequently shows the reverse - formal writing is assumed to avoid contractions, yet concordances reveal they are common in specific formal genres - particular collocations assumed to be rare prove pervasive in specific registers once a corpus is examined This empirical grounding is what distinguishes corpus linguistics from introspection-based approaches and makes concordancing indispensable for rigorous language research. ## What Concordances Reveal {-} Concordancing supports several distinct types of linguistic investigation: **Semantic analysis** — concordances allow researchers to identify the different senses of polysemous words, observe how context disambiguates meaning, and study **semantic prosody**: the tendency of words to collocate preferentially with words that carry a positive or negative evaluative charge. The word *cause*, for instance, collocates predominantly with negative nouns (*cause harm*, *cause problems*, *cause damage*) even though the word itself is semantically neutral. **Collocational analysis** — the study of words that habitually co-occur is one of the core contributions of corpus linguistics to lexicography and language description. Concordances make it possible to build **collocational profiles** for words and to use association measures (mutual information, log-likelihood) to distinguish significant co-occurrences from accidental ones. **Grammatical investigation** — concordances reveal verb complementation patterns (does this verb prefer an infinitival or a gerundive complement?), preposition selection, word order preferences, and evidence for grammaticalization processes. **Discourse and framing analysis** — by examining what vocabulary surrounds key terms, researchers can study how concepts are constructed and contested in public discourse, identify ideological positioning through word choice, and track the evolution of discursive strategies over time. **Historical linguistics** — applied to diachronic corpora, concordancing can document language change, track the semantic bleaching or expansion of words over centuries, and identify grammaticalization paths. ## The Quantitative–Qualitative Bridge {-} Concordancing is unusual among research methods in that it supports both quantitative and qualitative analysis within a single workflow. The quantitative dimension includes frequency counts (how often does the pattern occur?), collocational strength statistics, and distributional comparisons across sub-corpora. The qualitative dimension involves close reading of individual concordance lines, interpretation of pragmatic effects, and recognition of contextual nuance. This combination makes concordancing particularly well suited to **mixed-methods research** — studies that require both the breadth of corpus evidence and the depth of interpretive analysis. ## Application Areas {-} Concordancing serves researchers and practitioners across a wide range of disciplines: **Corpus linguistics** — documenting authentic language use at scale, testing theoretical claims against corpus evidence, and building usage-based grammatical descriptions. **Sociolinguistics** — comparing language use across social groups (age, gender, region, register), studying style-shifting, and investigating language and identity. **Historical linguistics** — tracking semantic change over time, documenting grammaticalization, and studying obsolescence and innovation. **Language teaching (Data-Driven Learning)** — rather than presenting rules abstractly, DDL approaches have learners discover patterns through guided concordance analysis, building more robust intuitions about authentic usage. **Literary and stylistic analysis** — concordances support authorship attribution, the study of recurring motifs and themes, and analysis of an author's stylistic evolution across a career. **Translation studies** — parallel concordancing of source and target texts helps translators find consistent equivalents, identify translation strategies, and maintain terminology consistency across large projects. **Lexicography** — modern dictionaries rely on corpus evidence obtained through concordancing to identify word senses, document collocations, find authentic example sentences, and discover new words and meanings. **Content analysis and digital humanities** — tracking how concepts are discussed in media, studying framing and ideology through keyword analysis, and examining the historical evolution of key terms. ::: {.callout-tip} ## Exercises: What Is Concordancing? ::: **Q1. A researcher notices that the word *risk* appears frequently in a corpus of financial news. She searches for it in a concordance and finds it consistently appears with words like *exposure*, *mitigation*, *management*, and *assessment*. What type of linguistic phenomenon is she observing, and what does it tell her about the word?** ```{r} #| echo: false #| label: "concepts_q1" check_question( "She is observing collocational patterning — the systematic co-occurrence of *risk* with specific vocabulary reveals that in financial news, the word is used predominantly in a technical, professional register with a focus on risk management rather than danger in the everyday sense", options = c( "She is observing frequency — the word *risk* is simply very common in financial texts", "She is observing collocational patterning — the systematic co-occurrence of *risk* with specific vocabulary reveals that in financial news, the word is used predominantly in a technical, professional register with a focus on risk management rather than danger in the everyday sense", "She is observing semantic prosody — *risk* collocates with negative words, confirming it is a negative word", "She is observing grammaticalization — *risk* is in the process of changing its grammatical category" ), type = "radio", q_id = "concepts_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The pattern of co-occurrence with *exposure*, *mitigation*, *management*, and *assessment* reveals the collocational profile of *risk* in financial discourse. This collocational profile is register-specific: the same word in everyday speech would collocate with very different partners (e.g. *take a risk*, *risk injury*). Corpus evidence of this kind is what gives collocational analysis its power — it reveals not just what words mean but what company they keep in specific contexts.", wrong = "Not quite. While frequency is relevant context, what the researcher is specifically observing is the pattern of co-occurrence — which words appear consistently alongside *risk*. This is collocational analysis. The specific collocates (*exposure*, *mitigation*, *management*) also tell us something about the register (professional financial discourse) and the dominant sense of the word in this corpus (risk as something to be managed rather than taken)." ) ``` --- **Q2. A student claims that concordancing is just a fancy search function — no different from Ctrl+F in a word processor. What is the most important limitation of Ctrl+F that concordancing overcomes?** ```{r} #| echo: false #| label: "concepts_q2" check_question( "Ctrl+F shows one match at a time in its original document context; concordancing displays all matches simultaneously with aligned context, enabling pattern recognition across instances that is impossible one match at a time", options = c( "Ctrl+F cannot handle regular expressions; concordancing can", "Ctrl+F is case-sensitive; concordancing is always case-insensitive", "Ctrl+F shows one match at a time in its original document context; concordancing displays all matches simultaneously with aligned context, enabling pattern recognition across instances that is impossible one match at a time", "Ctrl+F only works on single files; concordancing works on the internet" ), type = "radio", q_id = "concepts_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The defining feature of a KWIC concordance is the simultaneous, aligned display of all instances. This overcomes two key limitations of sequential search: working memory constraints (we cannot hold multiple instances in mind at once while reading) and confirmation bias (we tend to notice instances that confirm our expectations and overlook counter-examples). By displaying all instances at once with the node word aligned, concordances externalise the comparison task and make patterns visible that would never emerge from one-at-a-time search.", wrong = "Not quite — while concordancing does support regex and multi-file search, the most fundamental advantage is the KWIC display format itself. Displaying all instances simultaneously with the node word aligned in the centre allows pattern recognition across instances that is cognitively impossible when viewing matches one at a time in their original document context. This is what makes concordancing a qualitatively different analytical tool, not just a quantitatively more powerful search." ) ``` --- # Concordancing Tools {#tools} ::: {.callout-note} ## Section Overview **What you will learn:** The main categories of concordancing tool — desktop software, web-based corpus interfaces, and programming environments; the strengths and appropriate use cases of each; and why R with `quanteda` is the recommended environment for reproducible research ::: ## Desktop Concordancing Software {-} Desktop concordancing applications provide powerful functionality without requiring programming knowledge and remain the most widely used tools in teaching and exploratory research. **AntConc** ([laurenceanthony.net](https://www.laurenceanthony.net/software/antconc/)) is the most widely used free concordancing tool. It is cross-platform (Windows, Mac, Linux), requires no installation dependencies, and provides an intuitive interface ideal for teaching and exploratory analysis. Beyond concordancing, it offers collocate analysis, cluster/n-gram extraction, keyword comparison across corpora, and a dispersion plot view. Its main limitations are modest statistical capabilities, limited export options, and no integration with other analytical environments. AntConc is the recommended starting point for researchers new to concordancing. **WordSmith Tools** ([lexically.net](https://www.lexically.net/wordsmith/)) is commercial software that has long been the professional standard in corpus linguistics. It provides a comprehensive suite of interconnected tools including concordancing, keyword analysis, and dispersion plots, with sophisticated built-in statistics and professional-quality visualisations. Its main limitations are cost (though institutional licences are available) and Windows-only operation. **SketchEngine** ([sketchengine.eu](https://www.sketchengine.eu/)) is a web and desktop tool with access to pre-loaded corpora in over 90 languages, automatic corpus annotation, a distinctive "Word Sketch" display showing grammatical and collocational behaviour at a glance, and collaboration features for team research. It is subscription-based but widely used in professional and institutional contexts. **ParaConc** is purpose-built for parallel texts (source and translation), allowing researchers to search both sides simultaneously and identify translation strategies — essential for translation studies research. ## Web-Based Corpus Interfaces {-} Many large, professionally designed reference corpora are accessible through web interfaces, eliminating setup overhead and providing immediate access to billions of words of annotated text. **The BYU/English-Corpora.org family** ([english-corpora.org](https://www.english-corpora.org/)) created by Mark Davies includes several industry-standard reference corpora for English: - **COCA** (Corpus of Contemporary American English) — over 1 billion words (1990–present), balanced across spoken, fiction, magazine, newspaper, and academic genres, updated annually - **COHA** (Corpus of Historical American English) — 400+ million words (1820s–2000s) for diachronic research - **NOW Corpus** — continuously updated web corpus for tracking very recent language change These interfaces offer sophisticated search, metadata filtering (by genre, date, etc.), frequency trend visualisations, and free academic access. Their main limitation is that researchers cannot upload their own texts. **Lextutor** ([lextutor.ca](https://www.lextutor.ca/)) provides free web-based concordancing alongside vocabulary profiling tools, well suited to classroom use and quick lookups without any installation. ## Why R and quanteda? {-} Given the range of excellent GUI tools available, why invest in learning R for concordancing? The answer lies in four properties that become increasingly important as research grows in scale and ambition: **Reproducibility** is the most compelling reason. With a GUI tool, the workflow must be documented in prose ("I opened AntConc, loaded corpus X, searched for Y...") and is difficult for others to replicate exactly. With an R script, every step is explicit, executable, and shareable. Scripts can be submitted alongside manuscripts, satisfying open science requirements and enabling exact replication. **Integration** — R allows concordancing to be embedded in a seamless analytical pipeline. Text can be scraped from the web, cleaned, concordanced, collocates can be tested for statistical significance, results can be visualised with `ggplot2`, and statistical models can be fit — all within a single environment, without the error-prone import/export steps required when switching between tools. **Flexibility and customisation** — R allows arbitrary filtering logic, novel analysis approaches, and automation of repetitive tasks that would be tedious or impossible in GUI tools. This tutorial's custom concordance function (Section 10) illustrates how `quanteda` can be extended for specific research needs. **Scalability** — R handles corpora of any size efficiently, supports parallel processing, and can be deployed on cloud computing infrastructure for very large-scale projects. ::: {.callout-warning} ## When GUI Tools Are Better R's advantages come with a learning investment. For quick one-off explorations, classroom demonstrations with non-technical audiences, or analyses that are genuinely simple, GUI tools like AntConc or COCA are faster and more practical. Many experienced corpus linguists use both: GUI tools for rapid exploration and R for the final reproducible analysis. ::: ::: {.callout-tip} ## Exercises: Concordancing Tools ::: **Q3. A postgraduate student is conducting a diachronic study of the word *awful* across 200 years of American English. She has no programming experience. Which tool is most directly suited to her needs, and why?** ```{r} #| echo: false #| label: "tools_q1" check_question( "COHA (Corpus of Historical American English) via the English-Corpora.org web interface — it provides 400+ million words of balanced American English from the 1820s to the 2000s with metadata filtering by decade, accessible without any programming", options = c( "AntConc — it is free and works on all platforms, making it the best default choice", "COHA (Corpus of Historical American English) via the English-Corpora.org web interface — it provides 400+ million words of balanced American English from the 1820s to the 2000s with metadata filtering by decade, accessible without any programming", "R with quanteda — reproducibility is always the priority in academic research", "SketchEngine — it has the most languages and the most advanced features" ), type = "radio", q_id = "tools_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! COHA is specifically designed for diachronic research on American English, covering almost two centuries of balanced, genre-stratified text. The web interface provides frequency trend visualisations and metadata filtering by decade — exactly what a diachronic study requires — without any programming. AntConc would require the student to assemble and load a diachronic corpus herself, which is a significant additional task. R/quanteda is excellent but inappropriate given her current skill level and the availability of a purpose-built resource.", wrong = "Not quite. The key constraints are (a) the need for diachronic American English data and (b) no programming experience. COHA (the Corpus of Historical American English) is specifically designed for exactly this use case and is accessible through a web interface. AntConc would require her to build her own diachronic corpus. R/quanteda is powerful but requires programming skills she does not yet have." ) ``` --- **Q4. A research team plans to publish a corpus study of hedging in academic writing. They have built their own corpus from PDFs and need to submit their complete workflow alongside the manuscript. Which approach is most appropriate?** ```{r} #| echo: false #| label: "tools_q2" check_question( "R with quanteda — scripted workflows are fully reproducible, can be submitted alongside the manuscript, and satisfy open science reproducibility requirements", options = c( "AntConc — it is the most widely used tool so reviewers will be familiar with it", "WordSmith — it is the professional standard in corpus linguistics", "R with quanteda — scripted workflows are fully reproducible, can be submitted alongside the manuscript, and satisfy open science reproducibility requirements", "SketchEngine — it handles corpus upload and provides professional annotation automatically" ), type = "radio", q_id = "tools_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! When reproducibility is a core requirement — as it increasingly is for publication — only a scripted workflow satisfies it fully. An R script documents every step (file loading, preprocessing, search patterns, filtering, statistical tests) in executable form that any reviewer or reader can run on the same corpus to obtain exactly the same results. GUI workflows, however carefully documented in prose, cannot achieve this level of reproducibility.", wrong = "Not quite. The key requirement here is reproducibility for publication — the ability to submit an executable record of every step in the analysis. Only a scripted approach (R or Python) satisfies this requirement. GUI tools like AntConc, WordSmith, and SketchEngine produce results but do not generate a reusable, executable record of the analytical steps." ) ``` --- # Setup {#setup} ## Installing Packages {-} ```{r prep0, eval=FALSE, message=FALSE, warning=FALSE} # Run once — comment out after installation install.packages(c( "quanteda", # core concordancing and tokenisation "dplyr", # data manipulation "stringr", # string processing "writexl", # Excel export "here", # portable file paths "flextable", # formatted tables "tidyr", # data reshaping "ggplot2", # visualisation "checkdown" # interactive exercises )) ``` ## Loading Packages {-} ```{r prep1, message=FALSE, warning=FALSE} library(quanteda) library(dplyr) library(stringr) library(writexl) library(here) library(flextable) library(tidyr) library(ggplot2) library(checkdown) ``` ::: {.callout-tip} ## Best Practice: Load Packages at the Top Always load all packages at the very top of your script, before any analysis code. This makes dependencies immediately visible to anyone reading or reusing your script and avoids the common frustration of hitting a "function not found" error halfway through a long analysis. ::: --- # Loading and Preparing Text {#loading} ::: {.callout-note} ## Section Overview **What you will learn:** How to load a pre-saved text file into R; what raw text data looks like and why it requires preprocessing; how to clean text for concordancing using `stringr`; and why data preparation is an essential — not optional — step in any text analysis ::: ## Loading Alice in Wonderland {-} We use Lewis Carroll's *Alice's Adventures in Wonderland* throughout the practical sections of this tutorial. This classic novel provides rich literary language, memorable characters and constructions, sufficient length for meaningful frequency patterns, and is freely available in the public domain via Project Gutenberg. The text has been pre-downloaded and saved as an `.rda` file in the tutorial's `data/` folder so that the tutorial renders without requiring an active internet connection at knit time. ```{r load_alice, message=FALSE, warning=FALSE} rawtext <- readRDS(here::here("tutorials/kwics/data/alice.rda")) # loads object: rawtext ``` Let us inspect the first non-empty lines to understand the structure of the raw file: ```{r inspect_raw, message=FALSE, warning=FALSE} rawtext[rawtext != ""] |> head(25) ``` The output reveals several features typical of Project Gutenberg files: a title page, legal boilerplate, a table of contents, and chapter headings, all preceding the actual narrative text. These must be handled before analysis. ## Cleaning the Text {-} ::: {.callout-warning} ## The 80/20 Rule of Text Analysis In practice, researchers typically spend around 80% of their time cleaning and preparing data and 20% analysing it. This is not wasted effort — it is essential investment. Contaminated or inconsistently formatted data produces unreliable concordance results. Poor preparation leads to missed matches, false matches, and results that cannot be replicated. ::: We apply three cleaning steps in sequence: ```{r clean_alice, message=FALSE, warning=FALSE} text <- rawtext |> # 1. Collapse all lines into one continuous string paste0(collapse = " ") |> # 2. Normalise whitespace (multiple spaces → single space) stringr::str_squish() |> # 3. Remove Project Gutenberg header up to and including "CHAPTER I." stringr::str_remove(".*CHAPTER I\\.") ``` ```{r inspect_clean, message=FALSE, warning=FALSE} substr(text, 1, 600) ``` **What each step does:** `paste0(collapse = " ")` collapses all the separate line elements of the vector into a single continuous string. This is necessary because `kwic()` works best on continuous text rather than line-by-line vectors. `stringr::str_squish()` removes leading and trailing whitespace and reduces any internal sequence of whitespace characters (spaces, tabs, line breaks) to a single space. This standardises spacing throughout the document. `stringr::str_remove(".*CHAPTER I\\.")` removes everything from the beginning of the string up to and including the literal text "CHAPTER I." The `.*` matches any characters (including newlines if `dotall` is set), and `\\.` is an escaped period (`.` is a special regex character meaning "any character", so we must escape it to mean a literal full stop). ::: {.callout-tip} ## Exercises: Loading and Preparing Text ::: **Q5. You load a Project Gutenberg text and collapse it with `paste0(collapse = " ")`. Before running `str_squish()`, you notice the collapsed string contains stretches like `"Chapter I Down the Rabbit-Hole"`. What does `str_squish()` do to this, and why is this normalisation important for concordancing?** ```{r} #| echo: false #| label: "load_q1" check_question( "str_squish() reduces all internal runs of whitespace to a single space, producing 'Chapter I Down the Rabbit-Hole' — without this, tokenisation splits on whitespace and multiple spaces create empty tokens or shift context window boundaries", options = c( "str_squish() removes all spaces, producing 'ChapterIDowntheRabbit-Hole'", "str_squish() reduces all internal runs of whitespace to a single space, producing 'Chapter I Down the Rabbit-Hole' — without this, tokenisation splits on whitespace and multiple spaces create empty tokens or shift context window boundaries", "str_squish() has no effect on internal whitespace — it only removes leading and trailing spaces", "str_squish() replaces spaces with underscores to prevent tokenisation errors" ), type = "radio", q_id = "load_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! str_squish() trims leading/trailing whitespace AND reduces all internal whitespace sequences to a single space. In a concordance context, multiple consecutive spaces cause problems: tokenisers may produce empty tokens, and context window counts (measured in tokens) shift unpredictably depending on how many spaces fall within the window. Normalising to single spaces before tokenisation ensures consistent, predictable concordance output.", wrong = "Not quite. str_squish() performs three operations: removing leading whitespace, removing trailing whitespace, and reducing any internal run of whitespace characters (spaces, tabs, newlines) to a single space. In concordancing, consistent single-space tokenisation is important because context windows are measured in tokens — multiple spaces would shift window boundaries and produce inconsistent results." ) ``` --- **Q6. A colleague skips the `str_remove(".*CHAPTER I\\.")` step and runs the concordance directly on the uncleaned text. What specific problems might this cause?** ```{r} #| echo: false #| label: "load_q2" check_question( "The concordance will include hits from the table of contents, chapter headings, and legal boilerplate — inflating frequency counts and introducing non-narrative context into what should be a literary analysis", options = c( "The concordance will not run at all — quanteda requires pre-cleaned text", "The concordance will include hits from the table of contents, chapter headings, and legal boilerplate — inflating frequency counts and introducing non-narrative context into what should be a literary analysis", "The regular expression will fail because the text is too long", "The context windows will be too short because of all the extra text" ), type = "radio", q_id = "load_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The Project Gutenberg header includes a table of contents listing every chapter title — so any word appearing in a chapter title will be found in the concordance twice: once in the header and once in the actual narrative. The legal boilerplate also contains vocabulary that would produce spurious hits. For a literary analysis, the researcher almost certainly wants to analyse only the narrative text, making header removal an essential preprocessing step.", wrong = "Not quite. quanteda will process the uncleaned text without error — the problem is analytic, not technical. The Project Gutenberg header includes a table of contents that lists chapter titles, so chapter title words will appear in both the header and the narrative body, inflating their frequency. The legal boilerplate at the start contains vocabulary that has nothing to do with the literary text. Removing the header is essential for a clean analysis." ) ``` --- # Creating Concordances with quanteda {#kwic} ::: {.callout-note} ## Section Overview **What you will learn:** How `quanteda::kwic()` works; the structure of the KWIC output table; how to control the context window size; how to search for single words, phrases, and case-insensitive patterns; and how to interpret and count concordance results ::: ## Basic KWIC Extraction {-} The `kwic()` function is quanteda's concordancing engine. It requires a **tokens object** as its first argument — raw text must be passed through `quanteda::tokens()` before it can be concordanced. ```{r basic_kwic, message=FALSE, warning=FALSE} mykwic <- quanteda::kwic( quanteda::tokens(text), pattern = quanteda::phrase("Alice") ) |> as.data.frame() ``` ```{r basic_kwic_display, echo=FALSE, message=FALSE, warning=FALSE} mykwic |> head(8) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "First 8 concordance lines for 'Alice' in Alice's Adventures in Wonderland.") |> flextable::border_outer() ``` ## The Concordance Table Structure {-} The output is a data frame with six key columns: | Column | Contents | |---|---| | `docname` | Source document name (useful for multi-text corpora) | | `from` | Start position (token index) of the match | | `to` | End position (token index) of the match | | `pre` | Left context — tokens to the left of the keyword | | `keyword` | The matched token(s) | | `post` | Right context — tokens to the right of the keyword | The `pre` and `post` columns contain the context window. By default, `kwic()` returns 5 tokens on each side. ## Counting Matches {-} ```{r count_kwic, message=FALSE, warning=FALSE} nrow(mykwic) ``` There are `r nrow(mykwic)` occurrences of "Alice" in the text. We can also see what exact forms were matched: ```{r table_kwic, message=FALSE, warning=FALSE} table(mykwic$keyword) ``` ## Adjusting the Context Window {-} The `window` argument controls how many tokens appear on each side of the keyword. The default is 5; wider windows provide more context for interpretation, while narrower windows are better for studying immediate collocates. ```{r wide_kwic, message=FALSE, warning=FALSE} mykwic_wide <- quanteda::kwic( quanteda::tokens(text), pattern = quanteda::phrase("Alice"), window = 10 ) |> as.data.frame() head(mykwic_wide, 4) ``` **Guidelines for window size:** - **3 tokens** — immediate collocates; tight focus on the word's closest neighbours - **5 tokens** (default) — a good general starting point; captures most relevant local context - **10–15 tokens** — sentence-level context; useful for pragmatic and discourse analysis - **20+ tokens** — paragraph-level context; rarely needed and can make concordance lines unwieldy ## Phrase Search {-} `quanteda::phrase()` tells `kwic()` to treat a multi-word expression as a single search unit. Without `phrase()`, each word would be searched independently. ```{r phrase_kwic, message=FALSE, warning=FALSE} kwic_pooralice <- quanteda::kwic( quanteda::tokens(text), pattern = quanteda::phrase("poor Alice") ) |> as.data.frame() nrow(kwic_pooralice) head(kwic_pooralice, 5) ``` ## Case-Insensitive Search {-} By default, `kwic()` is case-sensitive. Setting `case_insensitive = TRUE` matches all capitalisation variants: ```{r case_kwic, message=FALSE, warning=FALSE} kwic_ci <- quanteda::kwic( quanteda::tokens(text), pattern = quanteda::phrase("alice"), case_insensitive = TRUE ) |> as.data.frame() # Compare case-sensitive vs case-insensitive counts cat("Case-sensitive matches:", nrow(mykwic), "\n") cat("Case-insensitive matches:", nrow(kwic_ci), "\n") ``` ## Searching Multiple Patterns Simultaneously {-} Pass a character vector to `pattern` (with `phrase()`) to search for several expressions in one call: ```{r multi_kwic, message=FALSE, warning=FALSE} alice_variants <- c("poor Alice", "little Alice", "dear Alice") kwic_variants <- quanteda::kwic( quanteda::tokens(text), pattern = quanteda::phrase(alice_variants) ) |> as.data.frame() table(kwic_variants$keyword) ``` ::: {.callout-tip} ## Exercises: Creating Concordances ::: **Q7. You run `kwic(tokens(text), pattern = phrase("the Hatter"))` and get 37 results. You then run `kwic(tokens(text), pattern = phrase("the hatter"), case_insensitive = TRUE)` and get 42 results. What explains the 5 additional matches in the second run?** ```{r} #| echo: false #| label: "kwic_q1" check_question( "The 5 additional matches are instances where 'the hatter' appears with non-standard capitalisation — for example at the beginning of a sentence ('The hatter') or in dialogue where the name is written in lowercase", options = c( "The case-insensitive search found 5 errors in the text where the author misspelled the name", "The 5 additional matches are instances where 'the hatter' appears with non-standard capitalisation — for example at the beginning of a sentence ('The hatter') or in dialogue where the name is written in lowercase", "The case-insensitive search is less accurate and produces 5 false positives", "The additional matches come from the Project Gutenberg header which was not fully removed" ), type = "radio", q_id = "kwic_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! In a literary text, a character's name may appear with different capitalisation in different positions. At the start of a sentence, 'The' is always capitalised, but 'hatter' might appear in lowercase. In the case-sensitive search, only the capitalised form 'the Hatter' is matched. The case-insensitive search additionally captures 'The hatter', 'the hatter', and any other capitalisation variant. The additional 5 matches represent these variant-capitalisation instances.", wrong = "Not quite. The difference between case-sensitive and case-insensitive search is purely about capitalisation. The 5 additional matches found by the case-insensitive search are instances of the same phrase with different capitalisation — most likely sentence-initial position where 'The' is capitalised but 'hatter' is not, or occurrences where the author used lowercase throughout." ) ``` --- **Q8. What is the difference between `kwic(tokens(text), pattern = "poor Alice")` and `kwic(tokens(text), pattern = phrase("poor Alice"))`?** ```{r} #| echo: false #| label: "kwic_q2" check_question( "Without phrase(), quanteda treats 'poor Alice' as two separate search terms and finds all tokens matching either 'poor' or 'Alice' anywhere in the text; with phrase(), it searches for the two words as a contiguous unit", options = c( "There is no difference — both search for the phrase 'poor Alice'", "Without phrase(), quanteda treats 'poor Alice' as two separate search terms and finds all tokens matching either 'poor' or 'Alice' anywhere in the text; with phrase(), it searches for the two words as a contiguous unit", "Without phrase(), the search is case-sensitive; phrase() makes it case-insensitive", "phrase() is only needed for expressions of three or more words; two-word expressions work without it" ), type = "radio", q_id = "kwic_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! quanteda::kwic() accepts either a single pattern token or a multi-token phrase object. Without phrase(), a multi-word string is treated as a glob pattern matching individual tokens, which produces unexpected results for multi-word expressions. phrase() instructs kwic() to require the tokens to appear consecutively in order — i.e., to treat the expression as a single search unit. Always use phrase() when searching for multi-word expressions.", wrong = "Not quite. The phrase() wrapper has a specific and important function: it tells quanteda to treat the expression as a contiguous multi-token unit. Without phrase(), a string like 'poor Alice' is interpreted as a glob pattern matching individual tokens, which may return all tokens matching 'poor' OR 'Alice' separately. With phrase(), both tokens must appear consecutively for a match to be returned." ) ``` --- # Exporting Concordances {#export} ::: {.callout-note} ## Section Overview **What you will learn:** How to export concordances to Excel, CSV, and R's native `.rds` format; how to use `here()` for portable file paths; and how to create formatted concordance tables for reports and presentations ::: ## Exporting to Excel {-} Excel is the most widely compatible format for sharing concordances with colleagues who do not use R: ```{r export_excel, eval=FALSE, message=FALSE, warning=FALSE} writexl::write_xlsx(mykwic, here::here("output", "alice_concordance.xlsx")) ``` ::: {.callout-tip} ## Using `here()` for File Paths The `here` package constructs file paths relative to your RStudio project root, making your code portable across operating systems and user accounts: ```r # Hard-coded path — breaks on any other machine write_xlsx(data, "C:/Users/Martin/Documents/project/output/file.xlsx") # here() path — works anywhere the project folder is write_xlsx(data, here::here("output", "file.xlsx")) ``` Always use `here::here()` in scripts you intend to share. ::: ## Other Export Formats {-} ```{r other_exports, eval=FALSE, message=FALSE, warning=FALSE} # CSV — universal plain-text format, ideal for version control write.csv(mykwic, here::here("output", "concordance.csv"), row.names = FALSE) # Tab-separated — handles commas in text better than CSV write.table(mykwic, here::here("output", "concordance.tsv"), sep = "\t", row.names = FALSE) # R native format — preserves all R object attributes, smallest file size saveRDS(mykwic, here::here("output", "concordance.rds")) # Reload later mykwic_reloaded <- readRDS(here::here("output", "concordance.rds")) ``` | Format | Best for | |---|---| | `.xlsx` | Sharing with non-R users; easy manual annotation | | `.csv` | Plain-text exchange; version control (Git-friendly) | | `.tsv` | Large texts with commas in content | | `.rds` | R-to-R sharing; preserves object class and attributes | ## Formatted Tables for Reports {-} For presentations and papers, use `flextable` to create publication-ready concordance displays: ```{r formatted_table, message=FALSE, warning=FALSE} mykwic |> head(10) |> dplyr::select(pre, keyword, post) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Concordance of 'Alice' — first 10 instances.") |> flextable::border_outer() ``` --- # Regular Expressions for Pattern Matching {#regex} ::: {.callout-note} ## Section Overview **What you will learn:** What regular expressions are and why they are essential for flexible concordancing; the three main categories of regex operator (frequency quantifiers, character classes, position anchors); how to use `valuetype = "regex"` in `kwic()`; and how to apply regex to find morphological word families, words of specific structures, and complex patterns ::: ## What Are Regular Expressions? {-} A **regular expression** (regex) is a sequence of characters that describes a search pattern. Where a literal search finds only the exact string specified, a regex can match an entire family of strings defined by a structural rule. For concordancing, this means that instead of running separate searches for *walk*, *walks*, *walked*, and *walking*, a single regex `\\bwalk\\w*` finds all of them at once. Regular expressions operate through three main types of operator: ## Frequency Quantifiers {-} These control how many times a unit must appear: ```{r regex_freq, echo=FALSE, message=FALSE, warning=FALSE} data.frame( Symbol = c("?", "*", "+", "{n}", "{n,}", "{n,m}"), Meaning = c( "Preceding item is optional (0 or 1 times)", "Preceding item appears 0 or more times", "Preceding item appears 1 or more times", "Preceding item appears exactly n times", "Preceding item appears at least n times", "Preceding item appears between n and m times" ), Example = c( "colou?r → colour, color", "walk\\w* → walk, walks, walked, walking", "walk\\w+ → walks, walked, walking (not bare walk)", "\\w{5} → any 5-letter word", "\\w{5,} → words of 5 or more letters", "\\w{4,6} → words of 4, 5 or 6 letters" ) ) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Regular expression frequency quantifiers.") |> flextable::border_outer() ``` ## Character Classes {-} These represent sets of characters: ```{r regex_classes, echo=FALSE, message=FALSE, warning=FALSE} data.frame( Symbol = c("[ab]", "[A-Z]", "[0-9]", "[:digit:]", "[:lower:]", "[:upper:]", "[:alpha:]", "[:alnum:]", "[:punct:]", "."), Meaning = c( "Literal a or b", "Any uppercase letter A through Z", "Any digit 0 through 9", "Any digit (equivalent to [0-9])", "Any lowercase letter", "Any uppercase letter", "Any letter (upper or lower)", "Any letter or digit", "Any punctuation character", "Any single character (wildcard)" ) ) |> flextable::flextable() |> flextable::set_table_properties(width = .75, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Regular expression character classes.") |> flextable::border_outer() ``` ## Position Anchors {-} These constrain where in a string the pattern must match: ```{r regex_anchors, echo=FALSE, message=FALSE, warning=FALSE} data.frame( Symbol = c("\\\\b", "\\\\B", "^", "$"), Meaning = c( "Word boundary (between a word character and a non-word character)", "Non-boundary (inside a word, between two word characters)", "Start of the string", "End of the string" ), Example = c( "\\\\brun\\\\b matches run but not running or rerun", "\\\\Brun\\\\B matches the run in rerunning but not standalone run", "^Alice matches Alice only at the very start of the text", "Alice$ matches Alice only at the very end of the text" ) ) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Regular expression position anchors.") |> flextable::border_outer() ``` ## Using Regex in kwic() {-} Set `valuetype = "regex"` to activate regular expression matching: ```{r regex_kwic, message=FALSE, warning=FALSE} # Find all words beginning with "alic" OR "hatt" kwic_regex <- quanteda::kwic( quanteda::tokens(text), pattern = "\\b(alic|hatt)\\w*", valuetype = "regex" ) |> as.data.frame() table(kwic_regex$keyword) ``` **Pattern breakdown:** - `\\b` — word boundary: the match must begin at the start of a word - `(alic|hatt)` — alternation: either "alic" or "hatt" - `\\w*` — zero or more word characters: captures any continuation of the stem ```{r regex_kwic_display, echo=FALSE, message=FALSE, warning=FALSE} kwic_regex |> head(8) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Concordance results using the regex pattern \\b(alic|hatt)\\w*.") |> flextable::border_outer() ``` ## Common Regex Concordance Patterns {-} ```{r regex_examples, eval=FALSE, message=FALSE, warning=FALSE} # All morphological forms of "think" kwic(tokens(text), pattern = "\\bthink\\w*", valuetype = "regex") # Matches: think, thinks, thinking, thinker, thought (if you add |thought) # Words ending in "-tion" kwic(tokens(text), pattern = "\\w+tion\\b", valuetype = "regex") # Words ending in "-ing" kwic(tokens(text), pattern = "\\w+ing\\b", valuetype = "regex") # Exactly four-letter words kwic(tokens(text), pattern = "\\b\\w{4}\\b", valuetype = "regex") # Words of ten or more letters kwic(tokens(text), pattern = "\\b\\w{10,}\\b", valuetype = "regex") # Words beginning with un- (negative prefix) kwic(tokens(text), pattern = "\\bun\\w+", valuetype = "regex") # Words beginning with a vowel kwic(tokens(text), pattern = "\\b[aeiou]\\w*", valuetype = "regex") ``` ::: {.callout-warning} ## Test Regex Patterns Before Running Regular expressions can be deceptively fragile. Small errors produce patterns that match either too much (returning thousands of false positives) or too little (missing the intended targets silently). Best practice is to test your pattern on a small sample using `stringr::str_detect()` or an online regex tester (e.g., regex101.com) before running it on a full corpus. Always inspect a random sample of the results to verify the pattern is working as intended. ::: ::: {.callout-tip} ## Exercises: Regular Expressions ::: **Q9. A researcher wants to find all words in the Alice text that begin with the prefix *re-* (such as *return*, *repeat*, *remember*). She writes the pattern `"re\\w*"` with `valuetype = "regex"`. What is the problem with this pattern, and how should it be fixed?** ```{r} #| echo: false #| label: "regex_q1" check_question( "The pattern matches 're' anywhere in a word, including in the middle (e.g. 'already', 'great', 'very') — adding \\\\b at the start fixes this by requiring the match to begin at a word boundary", options = c( "The pattern is correct — re\\\\w* only matches word-initial re-", "The pattern matches 're' anywhere in a word, including in the middle (e.g. 'already', 'great', 'very') — adding \\\\b at the start fixes this by requiring the match to begin at a word boundary", "The \\\\w* quantifier should be \\\\w+ to require at least one letter after 're'", "The pattern needs case_insensitive = TRUE to catch 'Re-' at sentence starts" ), type = "radio", q_id = "regex_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! Without a word boundary anchor \\b at the start, re\\w* matches any substring beginning with 're' anywhere in a word — including 'already' (which contains 'ready'), 'great' (which contains 're' embedded), and 'very'. The fix is \\bre\\w*, which restricts the match to positions at the start of a word. Adding \\w+ instead of \\w* is also a good idea if you want to exclude bare 're' as a match, since the prefix always attaches to a base.", wrong = "Not quite. The issue is the absence of a word boundary anchor. The pattern re\\w* matches any occurrence of the letters 're' followed by zero or more word characters, regardless of where in a word 're' appears. Words like 'already', 'great', 'theatre', and 'very' all contain 're' and would be matched. The fix is to add \\b before 're' to anchor the match to the start of a word: \\bre\\w+." ) ``` --- **Q10. You want to find all words in the text that end in either *-ful* or *-less* (e.g. *careful*, *careless*, *wonderful*, *hopeless*). Write the regex pattern you would use.** ```{r} #| echo: false #| label: "regex_q2" check_question( "\\\\w+(ful|less)\\\\b — this matches one or more word characters followed by either 'ful' or 'less' at a word boundary", options = c( "(ful|less) — the alternation alone is sufficient", "\\\\w+(ful|less)\\\\b — this matches one or more word characters followed by either 'ful' or 'less' at a word boundary", ".*ful|.*less — this uses the dot-star wildcard to match any prefix", "\\\\b\\\\w+(ful|less) — the boundary should be at the start, not the end" ), type = "radio", q_id = "regex_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! \\w+(ful|less)\\b requires at least one word character before the suffix (\\w+), followed by either 'ful' or 'less' as an alternation group, anchored at a word boundary (\\b) to prevent matching e.g. 'fulness' when you only want words ending in '-ful'. The \\w+ before the alternation ensures the suffix is attached to a stem rather than matching bare 'ful' or 'less'. The dot-star alternative (.*ful|.*less) would also work but is less precise — \\w+ is preferable because it restricts matches to word characters.", wrong = "Not quite. The pattern needs three components: (1) a requirement for a stem before the suffix — \\w+ matches one or more word characters; (2) an alternation group for the two possible suffixes — (ful|less); (3) a word boundary anchor \\b at the end to prevent matching words where -ful or -less is not the final element (e.g. 'fulness'). The full pattern is \\w+(ful|less)\\b." ) ``` --- # Filtering and Sorting Concordances {#filtering} ::: {.callout-note} ## Section Overview **What you will learn:** How to use `dplyr` pipelines to filter concordance output by context patterns; how to sort concordances alphabetically and by collocate frequency; how to extract the immediate left and right neighbours of a keyword; and why sorting and filtering are essential for moving from raw concordance output to linguistic insight ::: ## Why Filtering and Sorting Matter {-} A raw concordance of a high-frequency word in a full-length novel may contain hundreds or thousands of lines. The analytical work of concordancing only begins once the raw output has been filtered to relevant instances and sorted in ways that group similar contexts together. Two operations are central to this: **Filtering** restricts the concordance to lines meeting a specified context condition — for example, instances of *said* preceded by a character name, or instances of *very* followed by an adjective. Filtering uses `dplyr::filter()` in combination with `stringr::str_detect()`. **Sorting** reorders the concordance lines to surface patterns. Alphabetical sorting by the word immediately following the keyword groups lines that share the same collocate. Frequency sorting prioritises the most common collocates, immediately revealing the strongest patterns in the data. ## Filtering by Context Pattern {-} The following pipeline finds instances of *alice* only when the immediately preceding word is *poor* or *little*: ```{r filter_kwic, message=FALSE, warning=FALSE} kwic_filtered <- quanteda::kwic( x = quanteda::tokens(text), pattern = "alice", case_insensitive = TRUE ) |> as.data.frame() |> # Keep only lines where the last word of the left context is "poor" or "little" dplyr::filter(stringr::str_detect(pre, "(poor|little)$")) nrow(kwic_filtered) ``` ```{r filter_display, echo=FALSE, message=FALSE, warning=FALSE} kwic_filtered |> head(10) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Instances of 'alice' immediately preceded by 'poor' or 'little'.") |> flextable::border_outer() ``` **Pattern anatomy:** `str_detect(pre, "(poor|little)$")` checks whether the `pre` column (the left context string) ends with (`$`) either *poor* or *little*. The `$` anchor is important: without it, the filter would also match lines where *poor* or *little* appears earlier in the left context but is not the immediately preceding word. ## Multiple Conditions {-} Conditions can be combined with `&` (AND) or `|` (OR): ```{r multi_filter, message=FALSE, warning=FALSE} # Find "said" preceded by a character name AND followed by a comma or period kwic_said <- quanteda::kwic( x = quanteda::tokens(text), pattern = "said" ) |> as.data.frame() |> dplyr::filter( stringr::str_detect(pre, "(Alice|Hatter|Queen|King|Cat|Mouse)$"), stringr::str_detect(post, "^[,.]") ) head(kwic_said, 8) ``` This pipeline extracts speech acts of the form "[Character name] said, ..." — a common construction in narrative fiction. ## Alphabetical Sorting {-} Sorting alphabetically by the right context groups together lines with the same immediately following word, making collocational patterns immediately visible: ```{r sort_alpha, message=FALSE, warning=FALSE} kwic_sorted_alpha <- quanteda::kwic( x = quanteda::tokens(text), pattern = "alice", case_insensitive = TRUE ) |> as.data.frame() |> dplyr::arrange(post) head(kwic_sorted_alpha, 8) ``` ## Frequency Sorting {-} Frequency sorting identifies the most common collocates — the words that appear most often immediately before or after the keyword: ```{r sort_freq, message=FALSE, warning=FALSE} kwic_sorted_freq <- quanteda::kwic( x = quanteda::tokens(text), pattern = "alice", case_insensitive = TRUE ) |> as.data.frame() |> # Extract the first word of the right context dplyr::mutate(post1 = stringr::str_extract(post, "^\\w+")) |> # Count how often each right-context word occurs dplyr::add_count(post1, name = "post1_freq") |> # Sort from most to least frequent dplyr::arrange(dplyr::desc(post1_freq)) head(kwic_sorted_freq |> dplyr::select(pre, keyword, post, post1, post1_freq), 12) ``` ```{r freq_table, message=FALSE, warning=FALSE} # Summary table: top 10 right-context collocates kwic_sorted_freq |> dplyr::distinct(post1, post1_freq) |> dplyr::arrange(dplyr::desc(post1_freq)) |> head(10) |> flextable::flextable() |> flextable::set_table_properties(width = .4, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 11) |> flextable::set_caption(caption = "Top 10 words immediately following 'alice'.") |> flextable::border_outer() ``` ## Extracting N-gram Collocates {-} It is often useful to extract not just the immediately adjacent word but the first two or three words on each side, building a more complete collocational profile: ```{r ngram_collocates, message=FALSE, warning=FALSE} kwic_ngram <- quanteda::kwic( x = quanteda::tokens(text), pattern = "alice", case_insensitive = TRUE, window = 5 ) |> as.data.frame() |> dplyr::rowwise() |> dplyr::mutate( post1 = stringr::str_split(post, " ")[[1]][1], post2 = stringr::str_split(post, " ")[[1]][2], pre1 = dplyr::last(stringr::str_split(pre, " ")[[1]]), pre2 = rev(stringr::str_split(pre, " ")[[1]])[2] ) |> dplyr::ungroup() # Most common bigrams following "alice" kwic_ngram |> dplyr::count(post1, post2, sort = TRUE) |> head(10) |> flextable::flextable() |> flextable::set_table_properties(width = .4, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 11) |> flextable::set_caption(caption = "Most frequent bigrams immediately following 'alice'.") |> flextable::border_outer() ``` ::: {.callout-tip} ## Exercises: Filtering and Sorting ::: **Q11. You extract a concordance of the word *thought* in the Alice text and want to find only instances where the preceding context ends with *she* (i.e. "she thought"). Which `dplyr::filter()` call achieves this?** ```{r} #| echo: false #| label: "filter_q1" check_question( "dplyr::filter(str_detect(pre, 'she$')) — the $ anchor requires 'she' to be the last element of the left context string", options = c( "dplyr::filter(pre == 'she') — the pre column must equal 'she' exactly", "dplyr::filter(str_detect(pre, 'she$')) — the $ anchor requires 'she' to be the last element of the left context string", "dplyr::filter(str_detect(pre, '^she')) — the ^ anchor finds 'she' at the start of the left context", "dplyr::filter(keyword == 'she thought') — filter on the keyword column" ), type = "radio", q_id = "filter_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The pre column contains the full left-context string (e.g. 'looking at it she'). To select only lines where the word immediately before the keyword is 'she', we need str_detect(pre, 'she$') — the $ anchor at the end of the pattern means 'she' must be the final word in the pre string. Without the anchor, str_detect(pre, 'she') would also match lines where 'she' appears somewhere earlier in the left context but is not the immediately preceding word.", wrong = "Not quite. The pre column contains the full left-context string, not just the immediately preceding word. pre == 'she' would require the entire left context to be exactly 'she' — almost never the case with a 5-token window. str_detect(pre, '^she') uses a start-of-string anchor and would find 'she' at the beginning of the context window, not immediately before the keyword. The correct approach is str_detect(pre, 'she$') — the $ anchor ensures 'she' is the last word in the left context." ) ``` --- **Q12. After sorting a concordance of *very* by frequency of the immediately following word, you find that *much* is the most frequent right-context collocate. What does this tell you about the grammatical behaviour of *very* in the text, and what follow-up analysis might you do?** ```{r} #| echo: false #| label: "filter_q2" check_question( "*very much* is a strong fixed collocation; a follow-up analysis could examine what *very much* itself modifies (its own right context) to understand the pragmatic contexts in which this intensifier phrase appears", options = c( "It tells us *very* is used incorrectly — 'very much' is not standard English", "*very much* is a strong fixed collocation; a follow-up analysis could examine what *very much* itself modifies (its own right context) to understand the pragmatic contexts in which this intensifier phrase appears", "It means *much* is the most common word in the text overall", "The finding is not meaningful because frequency sorting always puts function words first" ), type = "radio", q_id = "filter_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The high frequency of 'much' as the immediate right-context collocate of 'very' identifies 'very much' as a strong lexical collocation — a fixed or semi-fixed multi-word unit. This is a standard finding in collocation analysis: many adverb+modifier combinations function as units. Useful follow-up analyses include: (1) extracting all 'very much' instances and examining what they modify or how they function pragmatically; (2) comparing the proportion of 'very much' vs. 'very + adjective' constructions; (3) studying whether 'very much' is used as a degree modifier or as a polite intensifier in different contexts.", wrong = "Not quite. 'Very much' being a frequent collocation is a linguistically meaningful finding — it suggests that 'very' in this text often functions as part of the fixed expression 'very much' rather than as a pre-adjectival intensifier (e.g. 'very large'). The appropriate follow-up is to treat 'very much' as a unit and study its distribution and function — a standard step in collocation analysis." ) ``` --- # Working with Spoken Transcripts {#transcripts} ::: {.callout-note} ## Section Overview **What you will learn:** How spoken language transcripts differ structurally from written text; how to load and preprocess ICE-Ireland transcript files; how to handle annotation markup in concordancing; and how to extract and analyse discourse markers in spoken data ::: ## Characteristics of Spoken Transcripts {-} Spoken language transcripts differ from written texts in several ways that require adapted preprocessing: **Speaker turn structure** — transcripts are organised by speaker turns, often with speaker IDs encoded in annotation tags. **Paralinguistic markers** — laughter, coughing, pauses, and overlaps are typically encoded using special markup: `<,>` for pauses, `<&> laughter </&>` for paralinguistic events. **Incomplete and non-standard forms** — spoken language contains false starts, filled pauses (*uh*, *um*), and incomplete utterances that do not correspond to standard written sentences. **Metadata headers** — corpus transcripts typically begin with file-level metadata: recording date, speaker demographics, and topic information. These features mean that the same preprocessing pipeline used for written texts will not work cleanly for transcripts. The approach must be adapted to the specific annotation conventions of the corpus being used. ## Loading ICE-Ireland Transcripts {-} We work with a sample of five files from the spoken dialogue section of the International Corpus of English — Irish component: ```{r trans_load, message=FALSE, warning=FALSE} files <- paste0("tutorials/kwics/data/ICEIrelandSample/S1A-00", 1:5, ".txt") transcripts <- sapply(files, readLines, USE.NAMES = TRUE) ``` ```{r trans_inspect, message=FALSE, warning=FALSE} # Inspect the first 12 lines of the first file transcripts[[1]][1:12] ``` The output reveals the ICE annotation conventions: - `<S1A-001 Riding>` — file header with ID and title - `<I>` — transcript start marker - `<S1A-001$A>` — speaker A in file 001 - `<#>` — speech unit boundary - `<,>` — pause - `<&> laughter </&>` — paralinguistic event ## Collapsing and Cleaning Transcripts {-} For basic concordancing we collapse each transcript to a single string and normalise whitespace. We retain the annotation tags for now — they can be used later to extract speaker-specific data: ```{r trans_collapse, message=FALSE, warning=FALSE} transcripts_collapsed <- sapply(transcripts, function(x) { x |> paste0(collapse = " ") |> stringr::str_squish() }) # Preview the first 400 characters of the first transcript substr(transcripts_collapsed[[1]], 1, 400) ``` ## Concordancing Transcripts {-} We search for the discourse marker *you know*, a high-frequency pragmatic expression in spoken Irish English used to signal common ground, manage turn-taking, and mark informational status: ```{r trans_kwic, message=FALSE, warning=FALSE} kwic_youknow <- quanteda::kwic( # Use "fasterword" tokeniser to preserve annotation tags as tokens quanteda::tokens(transcripts_collapsed, what = "fasterword"), pattern = quanteda::phrase("you know"), window = 10 ) |> as.data.frame() |> # Clean up document names to show only the file ID dplyr::mutate(docname = stringr::str_extract(docname, "S1A-\\d+")) ``` ```{r trans_display, echo=FALSE, message=FALSE, warning=FALSE} kwic_youknow |> head(10) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 9) |> flextable::set_caption(caption = "Concordance of 'you know' in five ICE-Ireland spoken dialogue files.") |> flextable::border_outer() ``` ::: {.callout-note} ## Why `what = "fasterword"`? The default `quanteda::tokens()` tokeniser strips punctuation and special characters. For annotated transcripts, this would remove the `<#>`, `<,>`, and speaker tags — information that may be important for the analysis. `what = "fasterword"` tokenises by whitespace only, preserving tags as tokens. A wider context window (here, 10) is also appropriate for spoken data because pauses, tags, and fillers occupy tokens within the window. ::: ## Distribution Across Files {-} We can compare the frequency of *you know* across the five files: ```{r trans_distribution, message=FALSE, warning=FALSE} kwic_youknow |> dplyr::count(docname, name = "n_youknow") |> dplyr::arrange(dplyr::desc(n_youknow)) |> flextable::flextable() |> flextable::set_table_properties(width = .35, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 11) |> flextable::set_caption(caption = "Frequency of 'you know' per file.") |> flextable::border_outer() ``` ## Filtering Out Annotation Tags {-} For some analyses, it is preferable to work with clean text from which all annotation markup has been removed. We can strip tags before concordancing: ```{r trans_clean, message=FALSE, warning=FALSE} transcripts_clean <- sapply(transcripts_collapsed, function(x) { x |> # Remove all XML-style tags stringr::str_remove_all("<[^>]+>") |> # Remove multiple spaces left by tag removal stringr::str_squish() }) substr(transcripts_clean[[1]], 1, 300) ``` ```{r trans_clean_kwic, message=FALSE, warning=FALSE} kwic_yk_clean <- quanteda::kwic( quanteda::tokens(transcripts_clean), pattern = quanteda::phrase("you know"), window = 7 ) |> as.data.frame() head(kwic_yk_clean, 5) ``` ::: {.callout-tip} ## Exercises: Spoken Transcripts ::: **Q13. You are studying hedging in spoken Irish English and want to find all instances of *kind of* and *sort of* in the ICE transcripts. Write the `kwic()` call that would achieve this using both the annotated and cleaned versions of the transcripts.** ```{r} #| echo: false #| label: "trans_q1" check_question( "kwic(tokens(transcripts_collapsed, what = 'fasterword'), pattern = phrase(c('kind of', 'sort of')), window = 7) for annotated; replace tokens(..., what='fasterword') with tokens(transcripts_clean) for the cleaned version", options = c( "kwic(transcripts_collapsed, pattern = 'kind of|sort of', valuetype = 'regex')", "kwic(tokens(transcripts_collapsed, what = 'fasterword'), pattern = phrase(c('kind of', 'sort of')), window = 7) for annotated; replace tokens(..., what='fasterword') with tokens(transcripts_clean) for the cleaned version", "kwic(tokens(transcripts_collapsed), pattern = phrase('kind of sort of'))", "filter(kwic_youknow, str_detect(keyword, 'kind|sort'))" ), type = "radio", q_id = "trans_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! For the annotated transcripts, what = 'fasterword' is needed to preserve the annotation tags. To search for two different phrases simultaneously, pass a character vector to pattern inside phrase(). For the cleaned transcripts, the default tokeniser works fine since tags have been removed. A wider context window (7–10) is appropriate for spoken data where pauses and tags consume tokens within the window.", wrong = "Not quite. Two issues: (1) for multi-word phrases, you must use phrase() — without it, 'kind of' would be treated as two separate search terms. (2) For the annotated transcripts, you need what = 'fasterword' to prevent the default tokeniser from stripping the annotation tags. The correct call is: kwic(tokens(transcripts_collapsed, what = 'fasterword'), pattern = phrase(c('kind of', 'sort of')), window = 7)." ) ``` --- **Q14. After running a concordance of *you know* on the annotated transcripts, you notice that many concordance lines contain annotation tags like `<#>` and `<,>` in the context windows. A colleague says you should always remove the tags before concordancing. Do you agree?** ```{r} #| echo: false #| label: "trans_q2" check_question( "It depends on the research question — if you are studying the relationship between 'you know' and speech unit boundaries or pauses, the tags are informative and should be retained; if you are studying lexical collocates, removing them is sensible", options = c( "Yes — tags should always be removed because they are not linguistic content", "No — tags should always be retained because they encode important prosodic information", "It depends on the research question — if you are studying the relationship between 'you know' and speech unit boundaries or pauses, the tags are informative and should be retained; if you are studying lexical collocates, removing them is sensible", "It does not matter — quanteda ignores annotation tags automatically" ), type = "radio", q_id = "trans_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The decision to retain or remove annotation markup depends entirely on the research question. If you are studying whether 'you know' tends to appear at speech unit boundaries (marked by <#>) or before or after pauses (marked by <,>), the tags are the data — removing them would destroy the evidence you need. If you are studying the lexical environment of 'you know' (what words collocate with it), the tags are noise that should be removed to bring the lexical collocates into the context window. Good corpus methodology always begins with a clear research question before deciding how to preprocess the data.", wrong = "Not quite. Whether to retain or remove annotation tags is not a universal methodological decision — it depends on what the tags represent and what the research question is. Tags encoding speech unit boundaries, pauses, or speaker turns are linguistic data in their own right. If the research question involves these phenomena, removing the tags destroys the evidence. If the question is purely about lexical collocates, removing tags is appropriate." ) ``` --- # A Custom Concordance Function {#custom} ::: {.callout-note} ## Section Overview **What you will learn:** How `quanteda::kwic()` works internally; how to build an improved custom concordance function with character-based (rather than token-based) context windows, structured output with named columns, and input validation; and when a custom function provides capabilities that `kwic()` does not ::: ## Why Build a Custom Function? {-} `quanteda::kwic()` is an excellent, well-tested concordancer for the vast majority of use cases. There are, however, situations where a custom function is useful: - **Character-based context windows** — `kwic()` measures context in tokens. For some analyses (e.g. studies of typographic or layout features, or working with very short texts), a character-count window is more appropriate. - **Fine-grained output control** — a custom function can return exactly the columns and naming conventions your workflow requires without post-hoc renaming. - **Educational transparency** — writing the function from scratch makes the mechanics of concordancing explicit and demystifiable. - **Integration with non-standard inputs** — for data sources that do not fit naturally into quanteda's corpus/tokens workflow, a character-level function may be simpler. ## The Improved Custom Function {-} The function below improves on the approach in the original draft in several ways: 1. **Input validation** — it checks that the text and pattern are non-empty and the context length is positive, issuing informative error messages if not. 2. **Vectorised output** — it returns a properly typed `tibble` rather than a matrix-derived data frame. 3. **Named, clean columns** — `Left`, `Node`, `Right`, `DocID`, and `MatchID` are returned for clarity. 4. **Multiple document support** — the function accepts a named vector and records the source document for each hit. 5. **Safe handling of edge positions** — matches near the start or end of a text are handled without errors by clamping the window to the text boundaries. ```{r custom_fn, message=FALSE, warning=FALSE} concordance <- function(texts, pattern, context = 80, ignore_case = FALSE, perl = TRUE) { # ── Input validation ────────────────────────────────────────────────────── if (!is.character(texts) || length(texts) == 0) stop("`texts` must be a non-empty character vector.") if (!is.character(pattern) || length(pattern) != 1 || nchar(pattern) == 0) stop("`pattern` must be a single non-empty character string (regex allowed).") if (!is.numeric(context) || context < 1) stop("`context` must be a positive integer (number of characters per side).") context <- as.integer(context) # ── Ensure texts are named (for DocID column) ───────────────────────────── if (is.null(names(texts))) names(texts) <- paste0("doc", seq_along(texts)) # ── Process each document ───────────────────────────────────────────────── results <- purrr::imap(texts, function(txt, doc_id) { # Find all match positions m <- gregexpr(pattern, txt, ignore.case = ignore_case, perl = perl)[[1]] # No matches in this document if (m[1] == -1) return(NULL) match_starts <- as.integer(m) match_lengths <- attr(m, "match.length") match_ends <- match_starts + match_lengths - 1L txt_len <- nchar(txt) purrr::pmap_dfr( list(match_starts, match_ends, seq_along(match_starts)), function(ms, me, idx) { # Character-based window, clamped to text boundaries left_start <- max(1L, ms - context) right_end <- min(txt_len, me + context) left <- substr(txt, left_start, ms - 1L) node <- substr(txt, ms, me) right <- substr(txt, me + 1L, right_end) # Trim to nearest word boundary for cleaner display left <- stringr::str_remove(left, "^\\S*\\s") # remove partial first word right <- stringr::str_remove(right, "\\s\\S*$") # remove partial last word tibble::tibble( DocID = doc_id, MatchID = idx, Left = left, Node = node, Right = right ) } ) }) # ── Combine and return ──────────────────────────────────────────────────── out <- dplyr::bind_rows(results) if (nrow(out) == 0) { message("No matches found for pattern: ", pattern) return(tibble::tibble(DocID = character(), MatchID = integer(), Left = character(), Node = character(), Right = character())) } out } ``` ## Using the Custom Function {-} ```{r custom_use1, message=FALSE, warning=FALSE} # Search for "you know" with 60-character context windows kwic_custom <- concordance( texts = transcripts_collapsed, pattern = "you know", context = 60 ) nrow(kwic_custom) head(kwic_custom, 6) ``` ```{r custom_display, echo=FALSE, message=FALSE, warning=FALSE} kwic_custom |> head(8) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 9) |> flextable::set_caption(caption = "Custom concordance output: character-based 60-character context windows.") |> flextable::border_outer() ``` ## Regex Patterns and Case Insensitivity {-} The function accepts any R regular expression in its `pattern` argument: ```{r custom_regex, message=FALSE, warning=FALSE} # Find "kind of" and "sort of" (hedges), case-insensitive kwic_hedges <- concordance( texts = transcripts_clean, pattern = "\\b(kind|sort) of\\b", context = 70, ignore_case = TRUE ) nrow(kwic_hedges) head(kwic_hedges, 5) ``` ## Comparing Token-Based and Character-Based Windows {-} The key practical difference between `kwic()` and our custom function is the unit of the context window: ```{r compare_windows, message=FALSE, warning=FALSE} # quanteda: 7-token window kwic_token <- quanteda::kwic( quanteda::tokens(transcripts_clean[[1]]), pattern = quanteda::phrase("you know"), window = 7 ) |> as.data.frame() |> head(3) |> dplyr::mutate(type = "token-based (7 tokens)") # Custom: 50-character window kwic_char <- concordance( texts = transcripts_clean[1], pattern = "you know", context = 50 ) |> head(3) |> dplyr::mutate(type = "char-based (50 chars)") # Show the difference cat("=== Token-based (7 tokens) ===\n") kwic_token[1, c("pre", "keyword", "post")] cat("\n=== Character-based (50 chars) ===\n") kwic_char[1, c("Left", "Node", "Right")] ``` With a token-based window, a text full of short function words will show more words on each side than a text with many long technical terms — because a token is a token regardless of length. With a character-based window, the amount of displayed text is constant regardless of token length, which can be preferable when the visual appearance of the concordance matters. ::: {.callout-tip} ## Exercises: Custom Concordance Function ::: **Q15. The custom `concordance()` function uses `gregexpr()` to find match positions and then calls `substr()` to extract context. What would happen if you removed the line that trims to the nearest word boundary, and why was that step added?** ```{r} #| echo: false #| label: "custom_q1" check_question( "Without trimming, left and right context strings would begin and end mid-word, producing partial tokens that look like typos and make the concordance harder to read — trimming produces cleaner, word-aligned display", options = c( "The function would crash — gregexpr() requires word-boundary trimming to work", "Without trimming, left and right context strings would begin and end mid-word, producing partial tokens that look like typos and make the concordance harder to read — trimming produces cleaner, word-aligned display", "The function would return duplicate matches", "The context window would shift by one character causing all match positions to be off by one" ), type = "radio", q_id = "custom_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! A character-based window cuts the text at exactly the specified character position, which will often fall in the middle of a word. For example, a 60-character window might return '...and she was beginning to get very tired of sitting by her si' — where 'si' is the start of 'sister', cut off by the window boundary. Trimming to the nearest word boundary (removing the partial word at the start of left context and the end of right context) produces cleaner output: '...and she was beginning to get very tired of sitting by her'. The cost is a slightly variable effective window width.", wrong = "Not quite. The function would not crash without the trimming step — it would simply return context strings that begin and end at arbitrary character positions, which frequently fall inside a word. This produces partial tokens like 'si' (the beginning of 'sister') at the edge of the context, which look like typos and make the concordance harder to read. The trimming step removes the partial word at the left edge of the left context and the right edge of the right context, producing cleaner word-aligned display." ) ``` --- **Q16. You want to use the custom `concordance()` function to find all instances of British English *-ise* spellings (e.g. *recognise*, *organise*, *realise*) in a corpus of newspaper editorials. Write the function call you would use.** ```{r} #| echo: false #| label: "custom_q2" check_question( "concordance(texts = editorials, pattern = '\\\\b\\\\w+ise\\\\b', context = 80, ignore_case = TRUE) — the regex matches any word ending in -ise at a word boundary", options = c( "concordance(texts = editorials, pattern = 'ise', context = 80) — 'ise' alone is sufficient", "concordance(texts = editorials, pattern = '\\\\b\\\\w+ise\\\\b', context = 80, ignore_case = TRUE) — the regex matches any word ending in -ise at a word boundary", "concordance(texts = editorials, pattern = phrase('ise'), context = 80) — phrase() is needed for suffix patterns", "kwic(tokens(editorials), pattern = '*ise', valuetype = 'glob') — use quanteda for spelling variants" ), type = "radio", q_id = "custom_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The pattern \\b\\w+ise\\b breaks down as: \\b (word boundary at start), \\w+ (one or more word characters — the stem), ise (the target suffix), \\b (word boundary at end — prevents matching 'miser', 'miserable'). ignore_case = TRUE is a sensible precaution for sentence-initial capitalisation. The context of 80 characters is a reasonable default for editorial text where sentences tend to be longer than in conversation.", wrong = "Not quite. Just 'ise' would match this string anywhere in any word — including 'mise' (from mise-en-scène), 'promise' (where -ise is not an inflectional suffix), and 'miser'. The correct approach uses a word boundary at each end (\\b) and requires at least one word character before the suffix (\\w+): the pattern is \\b\\w+ise\\b. The custom function accepts standard R regex, so this pattern works directly in the pattern argument." ) ``` --- # Reproducible Workflows {#workflows} ::: {.callout-note} ## Section Overview **What you will learn:** How to organise a concordancing project for reproducibility; essential script documentation conventions; how to parameterise analyses for easy modification; and how to export results in formats that support open science ::: ## Project Structure {-} A well-organised project folder makes it easy to return to an analysis months later, share it with collaborators, or submit it alongside a manuscript: ``` my-concordance-project/ ├── data/ │ ├── raw/ # Original texts — never edit these │ └── processed/ # Cleaned texts and intermediate objects ├── scripts/ │ ├── 01-load-clean.R # Data loading and preprocessing │ ├── 02-concordance.R # KWIC extraction │ └── 03-analysis.R # Filtering, sorting, statistics ├── output/ │ ├── concordances/ # Saved KWIC tables │ └── figures/ # Plots and visualisations ├── README.md # Project description └── project.Rproj # RStudio project file ``` ## Script Documentation {-} A reproducible script begins with a header block and uses parameters rather than hard-coded values: ```{r reproducible, eval=FALSE, message=FALSE, warning=FALSE} # ============================================================ # Title: Concordance analysis of hedging in Alice in Wonderland # Author: Martin Schweinberger # Date: 2026-05-01 # Purpose: Extract and analyse instances of epistemic hedges # ============================================================ library(quanteda) library(dplyr) library(writexl) library(here) # ── Parameters (change these, not the code below) ─────────── CONTEXT_WINDOW <- 7 # tokens per side MIN_FREQ <- 3 # minimum collocate frequency to report OUTPUT_DIR <- here::here("output", "concordances") # ── Load and clean text ────────────────────────────────────── load("tutorials/kwics/data/alice.rda") # loads object: rawtext text_clean <- paste0(rawtext, collapse = " ") |> stringr::str_squish() |> stringr::str_remove(".*CHAPTER I\\.") # ── Extract concordances ───────────────────────────────────── kwic_hedge <- quanteda::kwic( quanteda::tokens(text_clean), pattern = quanteda::phrase(c("perhaps", "might", "seemed to", "appeared to", "as if")), window = CONTEXT_WINDOW ) |> as.data.frame() # ── Save results ────────────────────────────────────────────── dir.create(OUTPUT_DIR, showWarnings = FALSE, recursive = TRUE) writexl::write_xlsx( kwic_hedge, file.path(OUTPUT_DIR, paste0("hedges_kwic_", Sys.Date(), ".xlsx")) ) # ── Session info for reproducibility ───────────────────────── sessionInfo() ``` Key practices demonstrated here: - **Parameterisation** — `CONTEXT_WINDOW`, `MIN_FREQ`, and `OUTPUT_DIR` are defined at the top so the entire analysis can be reconfigured by changing three lines. - **Datestamped output** — `Sys.Date()` in the filename means each run creates a new output file, preserving the history of the analysis. - **`sessionInfo()` at the end** — records the exact R and package versions used, essential for replication. ## Common Mistakes and How to Avoid Them {-} ::: {.callout-warning} ## Top Five Concordancing Pitfalls **1. Forgetting case sensitivity.** `kwic(tokens(text), phrase("alice"))` returns zero results if the text uses "Alice". Always check capitalisation conventions in your data, and use `case_insensitive = TRUE` when appropriate. **2. Using regex without `valuetype = "regex"`.** The pattern `"walk.*"` without `valuetype = "regex"` is treated as a literal glob pattern, not a regular expression. The result may be zero matches or unexpected results. **3. Skipping preprocessing.** Running `kwic()` on uncleaned text means metadata, headers, and formatting artefacts contaminate the concordance. **4. Not inspecting results.** Always examine a random sample of concordance lines to verify the pattern matched what you intended. Use `sample_n(20)` to spot-check. **5. Confusing `phrase()` and bare strings.** Always use `phrase()` for multi-word search targets. Without it, quanteda may treat the multi-word string as a glob matching individual tokens. ::: --- # Summary and Further Reading {#summary} This tutorial has provided a comprehensive introduction to concordancing with R, covering the conceptual foundations, the tool landscape, and the full practical workflow from loading text to exporting results. **Section 1** established what concordancing is — the systematic KWIC-display extraction of words in context — and why it is central to corpus linguistics: it grounds linguistic claims in observable, verifiable evidence, overcomes the limitations of native-speaker intuition and working memory, and enables both quantitative and qualitative analysis within a single framework. **Section 2** surveyed the concordancing tool landscape: desktop software (AntConc, WordSmith, SketchEngine), web corpus interfaces (COCA, COHA, Lextutor), and R with `quanteda`. The case for R rests on reproducibility, integration with statistics and visualisation, flexibility, and scalability. **Sections 3 and 4** covered data loading and preprocessing, with emphasis on the importance of cleaning (collapsing lines, normalising whitespace, removing headers) before concordancing. **Section 5** introduced `quanteda::kwic()` in depth: the structure of the KWIC output table, window size control, phrase search with `phrase()`, case-insensitive search, and multi-pattern search. **Section 6** introduced regular expressions as a tool for flexible concordancing: frequency quantifiers, character classes, position anchors, and common concordancing patterns including morphological word families, suffix-based searches, and word length filters. **Section 7** extended the workflow to filtering and sorting: using `dplyr` pipelines with `str_detect()` to restrict concordances to specific context conditions, and sorting by alphabetical and frequency criteria to surface collocational patterns. **Section 8** addressed spoken language transcripts, covering the structural differences between written and transcribed spoken text, how to handle annotation markup, and the discourse marker *you know* as a worked example. **Section 9** presented an improved custom concordance function using character-based context windows, input validation, and support for named multi-document input — extending `kwic()` for use cases that token-based windows cannot serve. **Section 10** demonstrated reproducible workflow conventions: project folder structure, parameterised scripts, datestamped output, and `sessionInfo()` documentation. **Further reading:** @sinclair1991corpus remains the foundational reference for concordance-based language description. @mcenery2011corpus provides a comprehensive methodological overview of corpus linguistics. @stubbs1996text is essential on collocation and semantic prosody. @anthony2013critical discusses tool choice for corpus work. @brezina2018statistics covers the statistical analysis of concordance-derived collocations. For quanteda specifically, the package documentation and tutorials at [tutorials.quanteda.io](https://tutorials.quanteda.io/) are the primary resource. --- # Citation & Session Info {-} Schweinberger, Martin. 2026. *Finding Words in Text: Concordancing with R*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/kwics/kwics.html (Version 2026.05.01). ``` @manual{schweinberger2026kwics, author = {Schweinberger, Martin}, title = {Finding Words in Text: Concordancing with R}, note = {tutorials/kwics/kwics.html}, year = {2026}, organization = {The University of Queensland, Australia. School of Languages and Cultures}, address = {Brisbane}, edition = {2026.05.01} } ``` ::: {.callout-note} ## AI Transparency Statement This tutorial was written with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. Claude was used to substantially expand and restructure a shorter existing LADAL draft tutorial on concordancing. All content — including all R code — was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy. ::: ```{r fin} sessionInfo() ``` --- [Back to top](#intro) [Back to LADAL home](/) --- # References {-}