Compiling a Corpus: From Texts to Analysis-Ready Data

Author

Martin Schweinberger

Introduction

This tutorial introduces the principles and practical techniques for compiling a corpus — the process of collecting, cleaning, formatting, and organising textual data for linguistic analysis. Corpus compilation is often treated as a preliminary step before the “real” analysis begins, but experienced corpus linguists know that it is where the most consequential decisions in any research project get made. As the saying goes: garbage in, garbage out. No amount of sophisticated statistical analysis can compensate for poorly designed or inadequately prepared data.

By the end of this tutorial you will have a clear, step-by-step framework for taking a corpus from an initial research idea through to a collection of clean, consistently formatted text files accompanied by a well-organised metadata spreadsheet. You will also have hands-on experience with the R tools most commonly used to automate and document this process.

Prerequisite Tutorials

Before working through this tutorial, you should be comfortable with:

This tutorial sits in the Data Collection and Acquisition section of LADAL. After completing it, you may want to continue with Web Scraping with R or Downloading Texts from Project Gutenberg for hands-on corpus collection practice.

Learning Objectives

By the end of this tutorial you will be able to:

  1. Explain why data collection and preparation are the foundation of corpus research and what “representativeness” means in practice
  2. Evaluate different strategies for selecting and collecting textual data — written, spoken, and existing corpora
  3. Identify and apply the five core principles of corpus data collection: purpose-driven collection, representativeness, comparability, ethical compliance, and documentation
  4. Choose an appropriate corpus size and sampling strategy for a given research question, including estimating needed corpus size based on phenomenon frequency
  5. Describe the specific compilation challenges and conventions for spoken, web/social media, learner, historical, specialised, and multilingual corpora
  6. Apply appropriate ethical frameworks (GDPR, Australian Privacy Act) to corpus data collection
  7. Convert PDFs and Word documents to plain text in R; detect and fix encoding problems
  8. Clean and format text files for corpus tools using R’s stringr and readr packages
  9. Describe the main types of corpus annotation — POS tagging, lemmatisation, dependency parsing, NER — and decide when annotation is appropriate
  10. Organise a shareable corpus using the standard folder structure: corpus root, README, LICENSE, metadata file, and data folder
  11. Write a README and choose an appropriate LICENSE for a research corpus
  12. Recognise common corpus folder structure variations for annotated, multi-genre, and diachronic corpora
  13. Design and populate a metadata spreadsheet that links text files to their contextual information
  14. Validate that metadata and corpus files are consistent using R
  15. Apply quality control procedures including inter-rater agreement, spot-sampling, and consistency checks
  16. Describe major publicly available corpora and select the most appropriate existing corpus for a given research question
  17. Recognise and avoid the seven most common pitfalls in corpus compilation
  18. Plan a corpus-based research project from research question to analysis-ready data
Citation

Schweinberger, Martin. 2026. Compiling a Corpus: From Texts to Analysis-Ready Data. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/corpuscompilation_tutorial/corpuscompilation_tutorial.html (Version 2026.05.01).


Why Data Collection Matters

Section Overview

What you will learn: Why corpus data collection is the foundation of corpus research; the “garbage in, garbage out” principle; how the research pipeline connects a question to a corpus to findings; and the core challenge of balancing ideal corpus design with practical constraints

Your corpus is your evidence

Everything you conclude in a corpus-based study rests on the foundation of your data. A well-designed corpus enables valid, reliable conclusions about language use. A poorly designed one — even if analysed with state-of-the-art methods — produces unreliable findings. This is the garbage in, garbage out principle, and it applies with particular force to corpus linguistics because the corpus is the evidence.

Researchers sometimes rush through data collection and preparation, eager to get to what they think of as the “real” analysis. But experienced corpus linguists know that data collection and preparation are the real work. They are where the critical decisions get made that shape what you can and cannot discover.

The research pipeline

Corpus research is iterative, not linear. The typical pipeline looks like this:

Research question
      │
      ▼
Data collection decisions
      │
      ▼
Data preparation (cleaning, formatting, organising)
      │
      ▼
Analysis
      │
      ▼
Interpretation ──────────────────────────────────┐
      │                                           │
      └── New questions → back to the beginning  ┘

Notice that interpretation often raises new questions that send you back to the beginning — perhaps you need different data, or need to prepare existing data differently. This iterative nature is entirely normal and should be built into your project timeline.

Balancing ideal design with real constraints

In an ideal world, corpus researchers would have unlimited time, complete access to any data they wanted, and infinite resources. In reality, every project involves constraints: limited time, restricted data access, finite budgets, and varying technical skills. The art of corpus building is making the best possible corpus within these constraints while being transparent about the limitations (Biber, Conrad, and Reppen 1998; McEnery and Hardie 2012).

Exercises: Why Data Collection Matters

Q1. A researcher collects 500 blog posts from a single popular blogging platform and uses them to make claims about “how English speakers write informally online.” What is the main methodological problem with this approach?






Principles of Data Collection

Section Overview

What you will learn: The five core principles that should guide every corpus data collection decision — purpose-driven collection, representativeness, comparability, ethical and legal compliance, and documentation

Five fundamental principles should guide your data collection decisions (Biber, Conrad, and Reppen 1998; Atkins, Clear, and Ostler 1992). They are not abstract ideals but practical guidelines that shape every decision from “where do I get my data?” to “what do I put in my metadata spreadsheet?”

1. Purpose-driven collection

You must start with clear research questions, and your data collection must align with those goals. This seems obvious, but it is easy to lose sight of when confronted with large amounts of conveniently available data. If you are studying informal spoken English, do not collect formal written texts just because they are easier to access. Your data must match your research purpose.

2. Representativeness

Your corpus should reflect the language variety you are investigating. You need to think carefully about:

  • Genre — which text types are included, and in what proportions?
  • Speaker or writer demographics — age, gender, first language, education level
  • Time period — synchronic (one time period) or diachronic (change over time)?
  • Region — British English? Australian English? International English?

Crucially, you need to acknowledge what your corpus does and does not represent (McEnery and Hardie 2012; Sinclair 1991). No corpus represents “all of English” or any entire language. If your corpus contains only written academic English from Australian universities in the 2020s, that is what you can make claims about. Do not extrapolate beyond what your data supports.

3. Comparability

If you are comparing groups — for example, first language versus second language writers, or formal versus informal registers — you must ensure comparable data collection methods. Use the same genres, similar text lengths, and equivalent contexts across the groups. If you compare L1 and L2 academic essays but the L1 essays were written under exam conditions while the L2 essays were written as take-home assignments, any differences you find might reflect those different conditions rather than genuine L1/L2 differences.

5. Documentation

Record all collection procedures in detail. Note your inclusion and exclusion criteria explicitly. Maintain metadata throughout the process — not as an afterthought at the end. This documentation serves two purposes: it enables other researchers to replicate your work, and it enables you to understand your own data months or years later when memory has faded.

Exercises: Principles of Data Collection

Q2. A researcher is studying differences in how L1 and L2 English speakers use hedging language in academic writing. She collects 100 L1 essays from a first-year composition course and 100 L2 essays from an English for Academic Purposes course. What comparability problem does this design have?






Types of Data Sources and Sampling

Section Overview

What you will learn: The main categories of textual data sources (written, spoken, existing corpora); practical tools and considerations for each; key decisions about corpus size, sampling strategy, and the balance between representativeness and comparability

Written sources

Published materials include books, newspapers, magazines, and academic journals. Online content — blogs, websites, forums, and social media platforms like Reddit and Twitter/X — provides massive amounts of naturally occurring text but requires careful attention to copyright, terms of service, and representativeness.

Unpublished materials such as student essays, emails, and organisational documents are often richer sources for specific research questions, but require consent from participants and may need anonymisation.

Spoken sources

Recorded speech — interviews, conversations, presentations, podcasts, lectures — offers access to naturally occurring spoken language. The key practical consideration is transcription: converting audio to text is time-consuming (typically 6–10 hours of transcription time per hour of recording) and expensive if outsourced. Factor this into your project timeline realistically.

Interview types vary in structure: structured interviews use fixed questions for all participants; semi-structured interviews have a guide but allow flexibility; unstructured interviews are more conversational. The choice affects both data richness and comparability.

Existing corpora

Before building a new corpus, always ask whether an existing corpus already addresses your research question. Publicly available corpora have significant advantages: they save time, are standardised, have consistent formatting and annotation, and have been validated through published research. The limitation is that they may not exactly match your specific research needs.

The table below lists the most widely used corpora in English and multilingual linguistics research:

Corpus Language Size Content Access
BNC (British National Corpus) English (British) 100M words Mixed written + spoken, 1985–1993 Free registration: natcorp.ox.ac.uk
BNC2014 English (British) 100M words Updated spoken British English Free: corpora.lancs.ac.uk/bnc2014
COCA (Corpus of Contemporary American English) English (American) 1B+ words Balanced: spoken, fiction, magazine, newspaper, academic Free limited / subscription: english-corpora.org
GloWbE (Global Web-Based English) English (20 countries) 1.9B words Web text from 20 English-using countries Free limited / subscription: english-corpora.org
ICE (International Corpus of English) English (25 varieties) ~1M words per variety Spoken + written, 1990s–2000s Varies by variety: ice-corpora.net
CHILDES Multiple Large Child language acquisition Free: childes.talkbank.org (MacWhinney 2000)
COCA-spoken English (American) 130M+ words Transcribed TV and radio Part of COCA
CLMET English (historical) 16M words Diachronic written English 1150–1920 Free: fedora.clarin.eu
MICASE English (academic spoken) 1.8M words University spoken interaction Free: lsa.umich.edu/eli/micase
europarl 21 EU languages Varies European Parliament proceedings Free: statmt.org/europarl (Koehn 2005)
OpenSubtitles 60+ languages Billions Film/TV subtitles Free: opus.nlpl.eu
Leipzig Corpora 300+ languages Varies Web-crawled text Free: corpora.uni-leipzig.de

For Australian and Pacific linguistics specifically: - COOEE (Corpus of Oz Early English) — Australian English 1788–1900 - ICE-AUS — International Corpus of English: Australian component - PARADISEC — Pacific and regional archive with oral language materials

Check existing corpora before building your own

A common mistake is building a new corpus when a suitable one already exists. Always search CLARIN (clarin.eu), the LDC (ldc.upenn.edu), ELRA (elra.info), and your institution’s library catalogue before investing in compilation. Even if no corpus exactly matches your needs, an existing corpus may serve as a comparison baseline or supplementary resource (Hunston 2002).

Corpus size decisions

How much data do you need? The answer depends on what you are studying:

Research focus Typical corpus size Rationale
Lexical studies (individual words) 1M+ words Rare words need large corpora to appear frequently enough
Syntactic patterns 100K–1M words Grammatical constructions are more frequent than rare words
Discourse analysis 10K–100K words Intensive analysis of fewer texts
Pilot study 50–100 texts Testing feasibility before full-scale collection
MA thesis 100K–500K words Typical scope for a supervised project
PhD dissertation 500K–1M+ words Larger scope required for claims of generalisability

These are guidelines, not rules. The right corpus size ultimately depends on how frequent the phenomenon you are studying is, and how much variation you need to capture.

Estimating corpus size from phenomenon frequency

A more principled approach is to estimate corpus size from the expected frequency of the phenomenon you are studying. The basic logic: if you want at least n examples of a feature for reliable analysis, and the feature occurs approximately f times per million words, you need at least n/f million words.

For example, if you want to study the discourse marker I mean (frequency approximately 200 per million words in spoken English) and want at least 500 examples, you need approximately 2.5 million words of spoken data.

A useful rule of thumb from Biber, Conrad, and Reppen (1998): for any linguistic feature occurring fewer than 10 times per million words, you need at least 10 million words to observe it reliably. Features occurring more than 100 times per million words can be studied in corpora as small as 100,000 words.

For comparative studies, this calculation applies to each subgroup separately. If you are comparing three regional varieties and need 200 examples of a construction per variety, you need sufficient data from each variety individually — not just in total.

Use keyword frequency as a proxy

If you are unsure of your phenomenon’s frequency, run a pilot search in an existing corpus such as COCA or the BNC. Note how often it occurs per million words, then calculate the corpus size you need. This 5-minute check can save weeks of over- or under-collection.

Sampling strategies

Random sampling — every text has an equal chance of being selected. Good for avoiding bias but may miss important variation if the population is heterogeneous.

Stratified sampling — you divide the population into subgroups (strata) first, then sample proportionally from each. Ensures representation across categories that matter for your research (e.g. genres, time periods, demographic groups).

Purposive sampling — deliberate selection based on specific criteria relevant to your research question. Common in qualitative and specialised corpus work.

Convenience sampling — using what is accessible. Acceptable if you are transparent about its limitations and do not overclaim the generalisability of your findings.

Balanced versus representative corpora

A balanced corpus has equal amounts from each category — excellent for comparing those categories directly. A representative corpus has proportions that match real-world distribution — better for describing language as a whole. The choice depends on your research questions. If you want to compare spoken and written English, a balanced design (equal amounts of each) lets you make fair comparisons. If you want to describe the English that people typically encounter, a representative design (reflecting that most encountered English is in fact written) is more appropriate.

Exercises: Data Sources and Sampling

Q3. A researcher wants to study how hedging language is used across different academic disciplines. She has access to journal articles from five disciplines: biology, psychology, economics, history, and linguistics. She collects 40 articles from biology (because it is easy to access) and 10 each from the other four disciplines. What sampling problem does this create?






Collecting Textual Data in Practice

Section Overview

What you will learn: Practical tools and strategies for collecting written text data — web scraping, social media APIs, manual collection, and elicitation; considerations specific to each approach; and how to use R to read a collection of text files into a usable format

Web scraping

Web scraping involves automatically extracting text from websites. Tools include Python libraries (BeautifulSoup, Scrapy), corpus-building tools like BootCaT, and R packages like rvest. See the Web Scraping with R tutorial for hands-on guidance.

Key considerations:

  • Respect robots.txt — this file indicates which parts of a website should not be scraped
  • Check terms of service — many websites restrict automated data collection
  • Use rate limiting — do not overload servers with rapid-fire requests
  • Dynamic content — content generated by JavaScript may require browser automation tools

Social media collection

Social media platforms provide APIs (Application Programming Interfaces) for structured data access: the Twitter/X API, Reddit API (accessible via the Python package PRAW), and others. APIs provide well-structured data with rich metadata, but come with rate limits, changing access policies, and significant ethical questions. The fact that a post is publicly visible does not automatically mean the author consented to it being used in a research corpus.

Social media API instability

Social media APIs change frequently and without warning. Twitter/X dramatically restricted API access in 2023, making large-scale corpus collection from that platform much harder overnight. Reddit similarly tightened API terms in 2023. Never build a research project that depends entirely on continued access to a social media API. Always have a backup data source, archive data as soon as it is collected, and check current API terms before committing to a collection strategy. The data you can access today may not be accessible next month.

Manual collection

Copying and pasting text from sources, or scanning physical documents and using OCR (optical character recognition), is time-intensive but gives you complete control over selection. Suitable for small, specialised corpora where careful selection is more important than scale.

Elicited data

Having participants produce language specifically for your research — essays, think-aloud protocols, writing tasks — ensures comparability across participants because everyone responds to the same prompt. The cost is that it requires participant recruitment, informed consent, and ethics approval.


Specialised corpus types

Different research domains present specific compilation challenges that go beyond the general principles covered above. This section introduces six corpus types that require particular consideration.

Spoken corpora

Spoken corpus compilation begins with audio or video recording and ends with a transcription. Every step introduces decisions that affect what linguistic phenomena can be studied.

Recording: Obtain written informed consent before recording. Record at the highest quality your equipment allows — poor audio quality makes transcription unreliable and may prevent analysis of prosodic features. For naturalistic data, be aware of the observer’s paradox: speakers behave differently when they know they are being recorded. Techniques to minimise this include lengthy familiarisation sessions, remote recording by participants themselves, and using experienced fieldworkers.

Transcription conventions: Choose a transcription system appropriate to your research goals before you begin:

System Used for Key features
CHAT (CHILDES/TalkBank) Child language, conversation Speaker turns, overlaps, non-verbal events, error coding (MacWhinney 2000)
GAT2 Conversation analysis Prosody, timing, overlap, intonation
HIAT Spoken corpora (Deppermann) Two-tier: verbal + non-verbal
Orthographic Large-scale corpora, ASR Plain text, no prosodic detail
TEI Digital humanities, archives XML-based, highly flexible

If you only need lexical and grammatical information, orthographic transcription is usually sufficient and much faster. If you are studying prosody, rhythm, or interactional features, you need a richer convention — but richer conventions are far more time-consuming.

Transcription time: Budget 6–10 hours of transcription per hour of audio for a trained transcriber working with clear speech. Difficult audio, overlapping speech, or a complex transcription system can push this to 20+ hours per hour.

Tools: ELAN (elan.mpi.nl) is the standard tool for time-aligned spoken corpus annotation. Transcriber and Praat are also widely used. For very large corpora, forced alignment tools (such as the Montreal Forced Aligner or WebMAUS) can automatically align an existing orthographic transcript to the audio, saving transcription time — but they require reasonably clean audio and work best with standard varieties.

Metadata for spoken corpora: Record speaker demographics (age, gender, first language, education, regional background), relationship between interlocutors (strangers, friends, colleagues), setting (formal/informal), and recording quality rating for each recording.

Web and social media corpora

Web and social media data is abundant and readily available, but several issues require careful attention beyond those already mentioned:

Ethical considerations: The Association of Internet Researchers (AoIR) guidelines distinguish between data that is genuinely public (e.g. press releases, public government documents) and data that is technically public but contextually private (e.g. a post in a support group, a tweet by a pseudonymous user). The legal ability to access data does not determine its ethical appropriateness for research. When in doubt, apply the question: would the author reasonably expect this text to appear in a research publication? If not, consider whether anonymisation, aggregation (reporting patterns without individual examples), or exclusion is appropriate.

Quality and representativeness: Web data is biased towards English, towards younger and more educated users, and towards specific genres and topics. It is not a random sample of language use. Be explicit about what your web corpus does and does not represent.

Deduplication: Web text is heavily duplicated — news articles are syndicated, forum posts are quoted in replies, and content aggregators republish material. Always check for and remove duplicate texts before analysis. Near-duplicate detection tools (e.g. the Python library datasketch) can identify texts that are very similar but not identical.

Temporal metadata: Web text should always be archived with a collection timestamp. A web page may change or disappear after you collect it; without a timestamp, your corpus is not reproducible.

Learner corpora

A learner corpus is a collection of language produced by second language (L2) learners, typically for the purpose of studying interlanguage — the developing linguistic system of learners at different stages of acquisition.

Elicitation tasks: The most common data sources are written essays (elicited via standardised prompts), oral production tasks (picture descriptions, narratives, role plays), and spontaneous speech. Standardised tasks enable comparability across learners; naturalistically collected data is less controlled but may be more ecologically valid.

Key metadata for learner corpora typically includes:

  • L1 (first language) — essential for all learner corpus studies
  • Proficiency level — ideally assessed independently (e.g. CEFR level from an external test), not self-reported
  • Length of residence in an L2-speaking country
  • Age of onset of L2 learning
  • Primary learning context (formal instruction vs. immersion)
  • Task type and prompt (for written production)
  • Time on task (for timed writing)

Well-known learner corpus resources (Flowerdew 2012):

  • ICLE (International Corpus of Learner English) — written essays by university L2 English learners from 16 L1 backgrounds
  • LINDSEI (Louvain International Database of Spoken English Interlanguage) — spoken counterpart to ICLE
  • EFCamDat — 1.2M scripts from Cambridge English learners, graded by CEFR level
  • PELIC (Pittsburgh English Language Institute Corpus) — longitudinal learner writing data

Error annotation: Many learner corpora include error annotation — marking and categorising grammatical, lexical, and orthographic errors. Error annotation schemes vary; the most widely used is the UCLES/UAM scheme. Error annotation is time-consuming and requires trained annotators and inter-rater reliability checks.

Historical and diachronic corpora

Historical corpus linguistics studies language change over time using texts from past periods. This presents unique compilation challenges.

Source materials: Historical texts exist as manuscripts, early printed books, and digitised archives. Sources include:

  • EEBO (Early English Books Online) — English texts 1475–1700
  • ECCO (Eighteenth Century Collections Online) — texts 1701–1800
  • Project Gutenberg — public-domain literary texts
  • Internet Archive — digitised books, newspapers, and other materials
  • National archives, manuscript collections, church records

OCR and its problems: Most historical corpus data reaches you via OCR (optical character recognition). OCR accuracy for modern printed text is typically 97–99%, but for historical texts with non-standard fonts (e.g. Fraktur, secretary hand, long-s), it can drop to 80–90% — meaning 1 in 10 characters may be wrong. Always inspect OCR output for common errors (long-s confused with f, period-space at line breaks creating artificial sentence boundaries, hyphenated words at line breaks split across lines).

Spelling normalisation: Historical English spelling was not standardised until the 18th century. The word the appears as þe, the, ye, ðe and many other forms in medieval texts. For frequency studies, you must decide whether to normalise spelling to modern equivalents (enabling direct comparison) or to preserve original spelling (enabling phonological and orthographic analysis). Document your decision explicitly.

Periodisation: Dividing a diachronic corpus into time periods is a theoretical decision, not merely a practical one. Period labels should be linguistically or historically motivated — not arbitrary. Common approaches include using major historical events as period boundaries, or using statistical change-point detection to identify periods of rapid linguistic change.

Metadata for historical corpora: Text date (exact if known, estimated if not, with confidence interval), text type and genre, scribal or printing context, provenance, and digitisation source.

Multilingual and parallel corpora

A multilingual corpus contains data from multiple languages compiled for comparable study. A parallel corpus contains translations of the same source texts — one source language aligned with one or more target languages.

Comparable vs. parallel: In a comparable corpus, texts are matched on genre, time period, and register but are independently produced — not translations. Parallel corpora contain direct translations. For translation studies and computational NLP (training machine translation systems), parallel corpora are preferred. For cross-linguistic typological or sociolinguistic research, comparable corpora are usually more appropriate.

Alignment: Parallel corpora require sentence-level or paragraph-level alignment — establishing which sentence in the translation corresponds to which sentence in the original. Alignment can be done automatically using tools like bleualign or hunalign, but results should always be spot-checked.

Key resources:

  • Europarl — 21 EU languages, parliamentary proceedings, sentence-aligned (Koehn 2005)
  • OpenSubtitles (opus.nlpl.eu) — 60+ languages, film/TV subtitles, sentence-aligned
  • WikiMatrix — 85 languages, mined from Wikipedia, sentence-aligned
  • CCAligned — web-crawled parallel data for 100+ languages
  • Universal Dependencies — dependency-parsed treebanks for 100+ languages (Nivre et al. 2016)

Metadata for multilingual corpora: Language, variety, country of origin, translation direction (if parallel), translator information (professional, crowdsourced), and date are all important. For spoken multilingual data, speaker language background and code-switching behaviour should be noted.


Converting documents to plain text in R

Corpus texts often arrive as PDFs or Word documents rather than plain text. Converting them programmatically is much faster than manual copy-paste and produces consistent results.

Converting PDFs to plain text

Code
# install.packages("pdftools")
library(pdftools)
library(dplyr)
library(readr)
library(purrr)

# Get all PDFs in a folder
pdf_files <- list.files(
  path       = "data/raw_pdfs",
  pattern    = "\\.pdf$",
  full.names = TRUE
)

# Convert each PDF to plain text
pdf_to_txt <- function(pdf_path) {
  # pdf_text() returns one string per page; collapse to single document
  pages <- pdftools::pdf_text(pdf_path)
  text  <- paste(pages, collapse = "\n\n")

  # Write to a .txt file in the output folder
  out_path <- file.path(
    "data/raw",
    paste0(tools::file_path_sans_ext(basename(pdf_path)), ".txt")
  )
  writeLines(text, out_path, useBytes = FALSE)
  message("Converted: ", basename(pdf_path))
  return(out_path)
}

dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
converted <- map_chr(pdf_files, pdf_to_txt)
message("Converted ", length(converted), " PDFs to plain text.")
OCR vs. text-layer PDFs

pdftools::pdf_text() extracts the text layer from a PDF — this works perfectly for digitally born PDFs (created by Word, LaTeX, or similar). For scanned PDFs (images of printed pages), there is no text layer, and pdf_text() will return an empty string. Scanned PDFs require OCR. In R, the tesseract package provides OCR capability:

# install.packages("tesseract")
library(tesseract)
text <- ocr("scanned_document.pdf")

OCR accuracy depends heavily on scan quality, font, and language. Always inspect OCR output before proceeding.

Converting Word documents to plain text

Code
# install.packages("officer")
library(officer)
library(dplyr)
library(purrr)

# Convert a single .docx file to plain text
docx_to_txt <- function(docx_path) {
  doc    <- officer::read_docx(docx_path)
  # Extract text content as a data frame, one paragraph per row
  content <- officer::docx_summary(doc)
  # Keep only paragraph text (not table cells, headers, etc. — adjust as needed)
  text_content <- content |>
    dplyr::filter(content_type == "paragraph") |>
    dplyr::pull(text) |>
    paste(collapse = "\n")

  out_path <- file.path(
    "data/raw",
    paste0(tools::file_path_sans_ext(basename(docx_path)), ".txt")
  )
  writeLines(text_content, out_path)
  message("Converted: ", basename(docx_path))
  return(out_path)
}

docx_files <- list.files("data/raw_docx", pattern = "\\.docx$",
                          full.names = TRUE)
converted  <- map_chr(docx_files, docx_to_txt)

Detecting and fixing encoding problems in R

Encoding errors — where characters appear as garbled symbols — are among the most common and most frustrating problems in corpus work. The root cause is almost always a mismatch between the encoding used to write the file and the encoding used to read it.

Code
library(readr)
library(stringr)

# Step 1: Detect the encoding of a suspicious file
suspect_file <- "data/raw/problem_text.txt"
readr::guess_encoding(suspect_file)
# Returns a tibble of likely encodings with confidence scores.
# Common results: UTF-8, ISO-8859-1 (Latin-1), windows-1252

# Step 2: Read with the detected encoding
text_raw <- readr::read_file(
  suspect_file,
  locale = readr::locale(encoding = "windows-1252")
)

# Step 3: Re-save as UTF-8
readr::write_file(text_raw, "data/raw/problem_text_utf8.txt")

# Step 4: Batch-fix all files with a known wrong encoding
fix_encoding <- function(file_path,
                          from_encoding = "windows-1252",
                          out_dir       = "data/raw") {
  text_raw <- readr::read_file(
    file_path,
    locale = readr::locale(encoding = from_encoding)
  )
  out_path <- file.path(out_dir, basename(file_path))
  readr::write_file(text_raw, out_path)
  message("Re-encoded: ", basename(file_path))
}

# Apply to all files in a folder that you suspect have wrong encoding
bad_files <- list.files("data/suspect", pattern = "\\.txt$", full.names = TRUE)
walk(bad_files, fix_encoding, from_encoding = "ISO-8859-1")

# Step 5: Quick visual check for common encoding artefacts
# If your text contains strings like ’ or é, it is UTF-8 data
# that was read as Windows-1252. Check with:
text_sample <- readr::read_file("data/raw/text_001.txt")
if (str_detect(text_sample, "â€|Ã")) {
  warning("Possible encoding error detected in text_001.txt")
}

Splitting a large text file into individual documents

Some corpus sources deliver all texts in a single large file with document delimiters. This code splits such a file into individual per-document files:

Code
library(stringr)
library(readr)
library(purrr)

# Example: a single file where each document starts with a marker like
# <text id="001"> ... </text>
# Adapt the pattern to match your actual delimiter

combined_file <- "data/raw/all_texts.txt"
raw_content   <- readr::read_file(combined_file)

# Split on document start marker (adjust regex to match your delimiter)
# This example assumes XML-style <text id="NNN"> markers
docs <- str_split(raw_content, "(?=<text id=)")[[1]]
docs <- docs[str_length(docs) > 10]  # discard empty splits

# For each document, extract the ID and write to a separate file
dir.create("data/raw/split", recursive = TRUE, showWarnings = FALSE)

walk(docs, function(doc) {
  # Extract document ID from the opening tag
  doc_id <- str_extract(doc, '(?<=id=")[^"]+')
  if (is.na(doc_id)) {
    doc_id <- paste0("doc_", format(Sys.time(), "%H%M%S%OS3"))
  }

  # Remove the XML wrapper tags if not needed
  text_only <- str_remove_all(doc, "</?text[^>]*>") |> str_trim()

  out_path <- file.path("data/raw/split", paste0(doc_id, ".txt"))
  writeLines(text_only, out_path)
})

message("Split into ", length(docs), " individual files.")

Reading text files into R

Once you have collected your text files, the first practical step is reading them into R. The following code reads all .txt files from a corpus folder and stores them in a data frame:

Code
library(dplyr)
library(readr)
library(purrr)
library(stringr)

# Path to your corpus folder
corpus_dir <- "data/corpus_texts"

# Get all .txt file paths
txt_files <- list.files(
  path       = corpus_dir,
  pattern    = "\\.txt$",
  full.names = TRUE
)

# Read each file and store as a data frame
corpus_df <- map_dfr(txt_files, function(f) {
  tibble(
    filename = basename(f),
    text     = read_file(f)   # read_file() reads the whole file as one string
  )
})

# Inspect
glimpse(corpus_df)
cat("Texts loaded:", nrow(corpus_df), "\n")
cat("Total characters:", sum(nchar(corpus_df$text)), "\n")
Using readLines() vs read_file()

readr::read_file() reads a whole file as a single character string — useful when you want to treat each file as one document. readLines() (base R) reads a file as a vector of lines — useful when you need line-by-line processing. For most corpus work, read_file() is more convenient.


Text Cleaning

Section Overview

What you will learn: Why text cleaning is necessary; what to remove and what to preserve; manual, semi-automated, and automated cleaning approaches; file formatting requirements for corpus tools; and how to implement systematic cleaning in R using stringr

Why clean?

Corpus tools expect clean, consistent formatting. Extraneous material — page numbers, HTML tags, navigation menus, copyright notices — will appear in your frequency counts and concordance lines, distorting your results. Standardisation enables accurate frequency counts and pattern searches. And cleaning reduces “noise”, making patterns clearer and analysis more efficient.

What to remove and what to preserve

Remove or clean:

  • Navigation elements from websites (menu text, breadcrumbs, sidebar content)
  • Boilerplate text (disclaimers, copyright notices, standard headers and footers)
  • Duplicate content (the same text appearing more than once)
  • Formatting codes (HTML tags, XML markup, PDF artefacts)
  • Encoding errors (garbled characters from wrong character encoding)

Preserve:

  • The actual language data you are studying
  • Sentence boundaries (crucial for many types of analysis — do not remove full stops)
  • Paragraph structure (if relevant to your research)
  • Punctuation (unless you have a specific reason to remove it)
  • Discourse markers and hedging devices (if those are what you are studying!)

Cleaning approaches

Manual cleaning uses text editors (Notepad++, Sublime Text, VS Code) with find-and-replace. Suitable for small corpora of fewer than 50 files. Time-intensive but gives you complete control.

Semi-automated cleaning uses regular expressions (regex) — powerful pattern-matching tools. For example:

Pattern Removes
<.*?> HTML tags
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z\|a-z]{2,}\b Email addresses
https?://\S+ URLs
\s{2,} Multiple consecutive spaces
^\s*\n Blank lines

Automated cleaning uses pre-built libraries: BeautifulSoup in Python for HTML, ftfy for encoding repair, and in R the stringr package and the tm package for text mining tasks.

Always check automated cleaning results

Automated cleaning can sometimes remove too much or miss problems. Always inspect a sample of cleaned texts by comparing them against the original. Look for both over-cleaning (important language data removed) and under-cleaning (problems that remain). Document your cleaning procedures so you can reproduce them and explain them to others.

Text cleaning in R

The stringr package provides a consistent, readable interface for string manipulation. Here is a systematic cleaning pipeline:

Code
library(stringr)
library(dplyr)

clean_text <- function(text) {
  text |>
    # Remove HTML tags
    str_remove_all("<[^>]+>") |>
    # Remove URLs
    str_remove_all("https?://\\S+") |>
    # Remove email addresses
    str_remove_all("\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b") |>
    # Standardise line endings (Windows CRLF → Unix LF)
    str_replace_all("\r\n", "\n") |>
    # Remove excessive blank lines (3+ consecutive newlines → 2)
    str_replace_all("\n{3,}", "\n\n") |>
    # Collapse multiple spaces to one
    str_replace_all("[ \t]{2,}", " ") |>
    # Remove leading/trailing whitespace per line
    str_replace_all("(?m)^[ \t]+|[ \t]+$", "") |>
    # Final trim
    str_trim()
}

# Apply to the corpus data frame
corpus_df <- corpus_df |>
  mutate(
    text_raw     = text,           # keep original
    text_cleaned = clean_text(text)
  )

# Sanity check: compare a sample
cat("=== ORIGINAL (first 300 chars) ===\n")
cat(substr(corpus_df$text_raw[1], 1, 300), "\n\n")
cat("=== CLEANED (first 300 chars) ===\n")
cat(substr(corpus_df$text_cleaned[1], 1, 300), "\n")

File format requirements

Most corpus tools (AntConc, Sketch Engine, R corpus packages) expect:

  • Plain text files (.txt) — not Word documents or PDFs; convert those first
  • UTF-8 encoding — the universal standard, supporting all characters across all languages and scripts
  • One text per file (preferred) — or multiple texts with clear delimiters
  • Consistent line endings — Unix-style (LF), not Windows-style (CRLF)

You can check and convert encoding in Notepad++ (via the Encoding menu) or with the command-line tool iconv.

File naming conventions

File names are more important than they might seem. A systematic naming convention encodes metadata directly in the filename, enabling sorting and filtering without even opening the files.

A recommended format: genre_year_speakerID_textID.txt

For example: - blog_2023_F28_001.txt — a blog post from 2023, female author aged 28, text number 001 - news_2024_NA_047.txt — a newspaper article from 2024, author age not available, text 047 - essay_2022_M22_L2_003.txt — an essay from 2022, male author aged 22, L2 writer, text 003

Exercises: Text Cleaning

Q4. A researcher is cleaning a corpus of forum posts downloaded as HTML. Her cleaning script removes all HTML tags using the regex pattern <.*?>. When she inspects a sample of cleaned texts, she finds that some forum posts now contain garbled runs of text like “amp; nbsp; gt;” scattered through them. What is the problem and how should she fix it?






Corpus Folder Structure: The Standard Layout

Section Overview

What you will learn: Why a consistent, documented folder structure matters for corpus sharing and reproducibility; the standard top-level components of a shareable corpus; and what each component should contain

A corpus is not just a folder full of text files. When a corpus is made available to other researchers — whether through a repository, a university data archive, or a direct request — it needs to be organised and documented so that anyone receiving it can understand what it contains, how it was compiled, and how to use it without needing to contact the compiler. Even if you never intend to share your corpus publicly, organising it this way protects you: it means that six months from now, when you return to the data, everything you need is in one place and clearly labelled.

The standard corpus folder layout

A well-organised shareable corpus follows a predictable structure that researchers have converged on across the field. The folder is named after the corpus using a short, recognisable abbreviation (all capitals is conventional):

LADALC/                          ← corpus root folder, named after the corpus
│
├── README.md                    ← who compiled it, what it contains, how to use it
├── LICENSE.txt                  ← usage rights and restrictions
├── metadata.csv                 ← one row per file; links files to speaker/text information
│
└── data/                        ← all corpus text files
    ├── text_001.txt
    ├── text_002.txt
    └── ...

This is the minimal structure. Every shareable corpus should have at least these four components. Here is what each one does:

The README file

The README is the first thing a new user reads. It should answer every question they might have before they open a single data file. A README written at the time of compilation is infinitely more accurate than one reconstructed from memory later.

Keep it in plain text (.txt) or Markdown (.md) so it is readable without special software. A minimal README should cover:

# CORPUS NAME (ABBREVIATION)

## Overview
Brief description of the corpus: what language variety, genre(s), 
time period, and purpose.

## Compilers
Name(s), affiliation(s), contact email, year of compilation.

## Contents
- Total number of files
- Total word count (approximate)
- File format (e.g. plain text UTF-8)
- Languages included

## Corpus Design
How texts were selected; sampling strategy; inclusion/exclusion criteria.

## Data Collection
Sources; collection methods; date(s) of collection.

## Annotation
What annotation (if any) is present; tagset used; annotation software.
If unannotated, state this explicitly.

## Ethical and Legal Status
Ethics approval reference (if applicable); consent procedures;
anonymisation procedures; copyright status of source material.

## How to Cite This Corpus
Full citation in a standard format (APA, MLA, or a corpus-specific format).

## Version History
Version number; date; description of changes.
Write the README as you compile, not after

The README is most accurate — and easiest to write — while you are actively making the decisions it describes. A README written six months after compilation is almost always incomplete. Treat the README as a living document: start it on day one and update it every time you make a significant decision about corpus design or processing.

The LICENSE file

The LICENSE tells users what they are and are not allowed to do with the corpus. Without an explicit license, users cannot legally redistribute, modify, or even be certain they can use the corpus for their own research. Common choices for research corpora:

License What it allows Common use case
CC BY 4.0 Free use, distribution, and modification with attribution Open research corpora with no restrictions on content
CC BY-NC 4.0 Free use with attribution; no commercial use Academic corpora you want to keep out of commercial products
CC BY-NC-ND 4.0 Attribution required; no commercial use; no derivatives Corpora where you need to control exactly how the data is used
Custom/restricted Defined by your institution or ethics approval Corpora with sensitive data or third-party copyright material

If your corpus contains data from participants who gave consent for specific uses only, your license must be consistent with those consent conditions. If it contains published texts, copyright in those texts may restrict redistribution regardless of what license you apply to your own compilation work.

The metadata file

The metadata file (typically metadata.csv) contains one row per corpus file and one column per metadata variable, with the filename as the linking key. This is covered in detail in the Organising Metadata section below.

The data folder

The data/ folder contains all corpus text files, named according to a consistent convention (see File naming conventions above). For a simple, unannotated corpus, this is a flat folder of .txt files. For more complex corpora, the data folder may contain subfolders — see the next section.


Corpus Folder Structure: Variations and Advanced Layouts

Section Overview

What you will learn: How corpus folder structures scale up as a corpus grows in complexity; layouts for corpora with multiple annotation layers, multiple genres, or multiple time periods; and when to use flat versus nested organisation

When you need subfolders in the data folder

A flat data/ folder is appropriate for simple, single-layer corpora. As soon as a corpus has more than one version of the data — for example, raw text alongside POS-tagged text — or more than one clearly distinct subcorpus, subfolders are needed.

Layout: raw and annotated data

The most common reason for subfolders is having both unannotated raw text and one or more annotated versions:

LADALC/
│
├── README.md
├── LICENSE.txt
├── metadata.csv
│
└── data/
    ├── raw/                     ← plain text files, no annotation
    │   ├── text_001.txt
    │   └── text_002.txt
    │
    └── annotated/               ← POS-tagged or otherwise annotated files
        ├── text_001_tagged.txt
        └── text_002_tagged.txt

The raw files are the source of truth; the annotated files are derived from them. This separation makes it easy to re-annotate if a better tagger becomes available, or to provide the corpus to users who only want the plain text.

Layout: multiple annotation layers

For corpora with multiple annotation types (POS tagging, dependency parsing, named entity recognition, sentiment scores), each annotation layer gets its own subfolder:

LADALC/
│
├── README.md
├── LICENSE.txt
├── metadata.csv
│
└── data/
    ├── raw/
    ├── pos_tagged/              ← part-of-speech tagged (e.g. CLAWS, TreeTagger)
    ├── parsed/                  ← dependency-parsed (e.g. Stanford, spaCy)
    └── ner/                     ← named-entity recognised

Each subfolder should be documented in the README: which tool was used, what tagset, what version of the software, and when annotation was performed (Garside, Leech, and McEnery 1997).

Layout: multiple subcorpora or genres

If a corpus contains clearly distinct sub-collections — different genres, different time periods, different speaker groups — these can be organised as subfolders within data/:

LADALC/
│
├── README.md
├── LICENSE.txt
├── metadata.csv                 ← single metadata file covering all subcorpora
│
└── data/
    ├── blogs/
    │   ├── blog_2022_F28_001.txt
    │   └── ...
    ├── newspapers/
    │   ├── news_2022_001.txt
    │   └── ...
    └── academic/
        ├── acad_2022_001.txt
        └── ...
Keep the metadata file unified

Even when the data folder contains subfolders, the metadata spreadsheet should remain a single file at the corpus root level. A fragmented metadata system — one spreadsheet per subfolder — makes cross-subcorpus analysis much harder and introduces consistency risks. The genre column in the metadata identifies which subcorpus a file belongs to; no need to encode it structurally.

Layout: diachronic or versioned corpora

For corpora that are extended over time or released in successive versions, a time-period or version-based organisation is appropriate:

LADALC/
│
├── README.md
├── LICENSE.txt
├── metadata.csv
│
└── data/
    ├── period1_1788-1825/
    ├── period2_1826-1850/
    ├── period3_1851-1875/
    └── period4_1876-1900/

Or for versioned releases:

LADALC/
├── README.md                    ← always describes the current version
├── CHANGELOG.md                 ← version history with dates and changes
├── LICENSE.txt
├── metadata.csv
└── data/

Layout: corpora with accompanying scripts

If you are sharing not just the corpus but also the R or Python scripts used to compile and analyse it — which is excellent practice for reproducibility — add a scripts/ folder alongside data/:

LADALC/
│
├── README.md
├── LICENSE.txt
├── metadata.csv
│
├── data/
│   └── ...
│
└── scripts/
    ├── 01_collect.R             ← data collection script
    ├── 02_clean.R               ← text cleaning script
    ├── 03_annotate.R            ← annotation script
    └── 04_analyse.R             ← analysis script

Numbering the scripts (01_, 02_, etc.) makes the intended execution order immediately clear. This layout, combined with a good README, gives other researchers everything they need to reproduce your entire workflow from raw data to published findings.


Corpus Folder Structure: Best Practices

Section Overview

What you will learn: Practical rules for keeping a corpus folder clean, consistent, and shareable; version control and backup strategies; and the most common organisational mistakes to avoid

Naming conventions for folders and files

The same principles that apply to file naming (see the Text Cleaning section) apply to folder names:

  • Use lowercase and underscorespos_tagged/ not POS Tagged/ or POS-Tagged. Spaces in folder names cause problems in the command line and in R paths.
  • Be descriptive but conciseraw/ and annotated/ are better than v1/ and v2/ because they tell you what the contents are, not just when they were created.
  • Never use special characters — no &, (, ), #, or % in folder or file names. These cause problems in URLs, command-line tools, and many corpus software packages.
  • Date-stamp output folders if you generate multiple versionsoutput_2026-03-15/ is much more informative than output_new/ or output_final/.

Separate raw data from derived data

The most important structural principle is: never overwrite or modify your raw data files. Once a text has been collected and placed in data/raw/, it should never change. All cleaning, annotation, and processing creates new files in new folders. This means you can always return to the original source if you need to re-process with different settings.

Raw data is sacred

If you overwrite your raw files during cleaning, you can never recover what the original data looked like. Always keep raw and processed data in separate folders. If storage space is a concern, compress the raw folder as a .zip archive, but never delete it.

Version control and backup

Local backups: Keep at least two copies of your corpus on separate physical devices. A backup on the same laptop as the original is not a backup — it disappears if the laptop is lost or stolen.

Cloud backup: Use university-provided cloud storage (OneDrive, SharePoint, Google Drive via your institution) for automatic synchronisation. Be aware of data governance requirements: sensitive or ethics-approved data may not be permitted on commercial cloud services.

Version control with Git: For corpora that change over time, Git (via GitHub or your institution’s GitLab) provides a complete history of every change. This is particularly valuable for corpora compiled by a team. The CHANGELOG.md file in the corpus root should record major version changes even if you do not use Git.

Archive for long-term preservation: For corpora you intend to publish, deposit them in a persistent repository such as CLARIN, Zenodo, PARADISEC (for Australian language data), or your institutional repository. These services provide a stable URL (DOI) you can cite in publications and guarantee long-term access.

Keep a research log

Alongside the corpus folder, maintain a research log — a dated plain-text or Markdown file recording every significant decision you make during compilation:

RESEARCH_LOG.md

2026-03-01  Started collecting blog posts from [source].
            Decided to include posts > 200 words only.
            Reason: shorter posts insufficient context for collocation analysis.

2026-03-08  Discovered encoding problem in 12 files from [source].
            Fixed with iconv -f windows-1252 -t utf-8.
            Affected files documented in metadata.csv, column 'encoding_notes'.

2026-03-15  Removed 8 files: duplicates of texts already in corpus.
            File IDs logged below.

This log is not the README (which describes the finished corpus) but a working document recording the messy reality of compilation. It is invaluable when you need to justify methodological decisions in a paper, respond to reviewer queries, or hand the project to a collaborator.

What to check before sharing a corpus

Before making a corpus available to others — whether by deposit in a repository, upload to a website, or transfer to a collaborator — work through this checklist:

Pre-sharing checklist
Exercises: Corpus Folder Structure

Q11. A researcher deposits her corpus in a university repository. Six months later, a colleague downloads it and finds three folders: data_final/, data_final_v2/, and data_FINAL_USE_THIS/. There is no README. What two fundamental best practices does this violate, and what should the corpus look like instead?






Organising Metadata

Section Overview

What is metadata and why does it matter?

Metadata is information about each text in your corpus — not the language data itself, but contextual information that describes it. Metadata is essential for:

  • Filtering — creating subcorpora for specific analysis (e.g. “only blog posts from 2022”)
  • Comparing — testing whether linguistic patterns differ across groups (e.g. by genre, author gender, or time period)
  • Contextualising — understanding what factors might be driving the patterns you observe
  • Replicating — enabling other researchers to understand exactly what your corpus contains

Metadata is often treated as an afterthought, but this is a critical mistake. If you do not record metadata at the time of collection, much of it cannot be recovered later.

Types of metadata

Bibliographic metadata provides information about authorship and publication:

  • Author or speaker information: demographics (age, gender, first language, education level)
  • Title, publication date, source
  • Genre or text type classification
  • Geographic origin of the text or speaker

Technical metadata documents the digital characteristics and processing history:

  • Filename and file size
  • Collection date and method
  • Word count and character count (essential for normalising frequency data)
  • Processing notes: has this text been cleaned? anonymised? converted from PDF?

Contextual metadata (especially important for spoken or interactive data):

  • Setting: location, formality level
  • Participants: their relationships and roles in the interaction
  • Purpose or topic of the interaction
  • Recording quality (affects what phenomena can be reliably transcribed and analysed)

Structuring a metadata spreadsheet

The most practical format is a spreadsheet — Excel, Google Sheets, or CSV. The key structural rules are:

  • One row per text
  • One column per metadata variable
  • First row contains column headers with clear, consistent variable names
  • One column must be a unique text ID that links to the filenames

Here is what a well-structured metadata spreadsheet looks like:

id text_id filename genre year author_gender author_age word_count
1 1 blog_2023_F28_001.txt blog 2023 F 28 1250
2 2 blog_2023_M35_002.txt blog 2023 M 35 980
3 3 news_2023_001.txt newspaper 2023 NA NA 2100

Metadata best practices

  • Use consistent codes throughout — choose either F/M or Female/Male for gender, but do not mix the two in the same spreadsheet
  • Avoid special characters in codes — they can cause problems in statistical software
  • Handle missing data consistently — use NA or leave blank, but choose one convention and stick to it
  • Create a codebook — a separate document that explains every variable, its description, and its possible values
  • Include version control information — when was this spreadsheet created? what version is it?

Example codebook entry:

Variable: author_gender
Description: Self-reported gender of the text's author
Values:
  F  = female
  M  = male
  O  = other/non-binary
  NA = not available or not applicable

Building a metadata spreadsheet in R

Code
library(dplyr)
library(stringr)
library(readr)

# Assume corpus_df already contains filename and text_cleaned columns
# Parse metadata from filenames (format: genre_year_authorID_textID.txt)

metadata_df <- corpus_df |>
  mutate(
    # Remove file extension
    name_noext = str_remove(filename, "\\.txt$"),
    # Split filename into components
    genre         = str_extract(name_noext, "^[^_]+"),
    year          = as.integer(str_extract(name_noext, "(?<=_)\\d{4}(?=_)")),
    author_id     = str_extract(name_noext, "(?<=\\d{4}_)[^_]+"),
    text_id       = str_extract(name_noext, "[^_]+$"),
    # Derive author gender and age from author_id (format: F28 or M35 or NA)
    author_gender = str_extract(author_id, "^[FMO]"),
    author_age    = as.integer(str_extract(author_id, "\\d+")),
    # Compute word count from cleaned text
    word_count    = str_count(text_cleaned, "\\S+")
  ) |>
  select(filename, genre, year, author_gender, author_age,
         text_id, word_count)

# Inspect
glimpse(metadata_df)

# Save metadata spreadsheet
write_csv(metadata_df, "data/corpus_metadata.csv")
message("Metadata saved to data/corpus_metadata.csv")

Linking metadata to analysis results in R

The power of a well-structured metadata spreadsheet becomes clear when you combine it with analysis results. For example, after computing word frequencies in your corpus, you can merge those results with metadata to compare frequency patterns across groups:

Code
# Example: load analysis results and merge with metadata
analysis_results <- read_csv("data/frequency_results.csv")

# Join by filename (the linking variable)
results_with_metadata <- analysis_results |>
  left_join(metadata_df, by = "filename")

# Now you can filter by any metadata variable
blog_results <- results_with_metadata |>
  filter(genre == "blog")

female_results <- results_with_metadata |>
  filter(author_gender == "F")

Validating metadata against corpus files

One of the most common and consequential errors in corpus work is a mismatch between the filenames listed in the metadata spreadsheet and the actual files in the corpus folder. This code performs the validation:

Code
library(dplyr)
library(readr)

# Load metadata
metadata_df <- read_csv("data/corpus_metadata.csv", show_col_types = FALSE)

# Get actual files in the corpus folder
actual_files <- tibble(
  filename = basename(list.files("data/raw", pattern = "\\.txt$",
                                  full.names = TRUE))
)

# Files in metadata but NOT on disk (orphan metadata entries)
orphan_metadata <- anti_join(metadata_df, actual_files, by = "filename")
if (nrow(orphan_metadata) > 0) {
  warning(nrow(orphan_metadata),
          " files listed in metadata have no corresponding file on disk:")
  print(orphan_metadata$filename)
}

# Files on disk but NOT in metadata (undocumented files)
undocumented <- anti_join(actual_files, metadata_df, by = "filename")
if (nrow(undocumented) > 0) {
  warning(nrow(undocumented),
          " files on disk have no metadata entry:")
  print(undocumented$filename)
}

if (nrow(orphan_metadata) == 0 && nrow(undocumented) == 0) {
  message("Validation passed: all files have metadata and all metadata has files.")
}

Run this check every time you add files to the corpus or update the metadata spreadsheet. The anti_join() pattern — files in A but not B, then files in B but not A — catches both types of mismatch.

Exercises: Metadata

Q5. A researcher builds a corpus of 200 student essays and records metadata in a spreadsheet. Three months later, when she comes to analyse the data, she finds that the ‘proficiency_level’ column contains a mixture of codes: some cells say ‘beginner’, some say ‘Beginner’, some say ‘beg’, and some say ‘1’. What problem does this create, and what should she have done to prevent it?






Ethics and Legal Frameworks

Section Overview

What you will learn: The key ethical frameworks relevant to corpus linguistics — GDPR, the Australian Privacy Act, and institutional ethics requirements; what informed consent must cover for language research; anonymisation strategies; copyright and fair dealing; and practical steps for different types of corpus data

Ethical and legal compliance is not a bureaucratic hurdle to clear before getting to the “real” research — it is a fundamental obligation to research participants, to the public, and to the integrity of the discipline. This section covers the main frameworks you are likely to encounter in Australia, the UK, and the EU, followed by practical guidance for common corpus scenarios.

Key regulatory frameworks

In Australia, research involving human participants is governed by the National Statement on Ethical Conduct in Human Research (NHMRC, 2007/2018). Institutional ethics approval is required for research involving human participants, with expedited or low-risk pathways available for research that poses minimal risk (e.g. analysis of publicly available texts). The Privacy Act 1988 and the Australian Privacy Principles govern the collection, use, and storage of personal information. Research corpora containing identified or identifiable participant data must comply with these principles.

In the EU and UK, the General Data Protection Regulation (GDPR) (or UK GDPR post-Brexit) governs any processing of personal data of EU/UK residents, regardless of where the researcher is located. Key points for corpus researchers:

  • Personal data includes anything that could identify a person — names, voices, writing styles, combinations of demographic characteristics
  • Language data from identified individuals is personal data
  • Processing personal data for research requires a legal basis — typically scientific research under Article 89, which allows broader use than commercial processing but still requires appropriate safeguards
  • You must conduct a Data Protection Impact Assessment (DPIA) for high-risk processing
  • Data minimisation: collect only what you need; anonymise as soon as possible
  • Data subjects have rights including access and erasure — think about how you will handle these requests

Anonymisation strategies for corpus linguistics:

Strategy What it involves When to use
Pseudonymisation Replace real names with codes (e.g. Speaker A, P1) Most spoken and interview corpora
Removal Delete identifying information entirely When even pseudonyms could reveal identity
Generalisation Replace specific details with ranges (e.g. “in her 30s” instead of “34”) Demographic details in metadata
Paraphrase Rewrite identifying content in indirect speech When a direct quote would identify a participant
Aggregation Report patterns without individual examples When any example could identify the source
Voice recordings require extra care

Even after pseudonymisation of names in a transcript, an audio recording can still identify a speaker. For corpora where the audio is distributed alongside transcripts, consider whether voice disguise (pitch shifting) is needed, or whether audio distribution is appropriate at all. Some sensitive corpora distribute only the transcript, not the audio.

Corpus Annotation

Section Overview

What you will learn: What corpus annotation is and why it matters; the main types of annotation — POS tagging, lemmatisation, dependency parsing, named entity recognition, and semantic annotation; when to annotate and when plain text is sufficient; annotation formats; inter-rater reliability; and how to apply basic annotation in R

What is annotation and why does it matter?

Corpus annotation is the process of adding linguistic information to corpus texts — labelling words, phrases, or larger units with information that is not present in the raw text itself (Garside, Leech, and McEnery 1997; McEnery and Hardie 2012). Annotation transforms a corpus from a collection of raw strings into a structured linguistic resource that enables more powerful and precise analysis.

The key trade-off: annotation takes time and introduces potential errors (every automated tagger makes mistakes; every human annotator is inconsistent), but it unlocks analyses that are impossible or unreliable on raw text — for example, finding all instances of a word used as a noun (excluding its uses as a verb), or searching for all syntactic subjects of a particular verb.

Main annotation types

Part-of-speech (POS) tagging

POS tagging assigns a grammatical category label to each token — noun, verb, adjective, adverb, preposition, and so on. It is the most widely used form of corpus annotation (McEnery and Wilson 1996; Hunston 2002).

Most tagsets are based on either the Penn Treebank tagset (45 tags, widely used in English NLP) or the Universal Dependencies (UD) tagset (17 universal tags, designed for cross-linguistic use). CLAWS (used in the BNC) is a 61-tag set designed specifically for large corpora.

Common POS taggers for English: - udpipe (R package, covered in the LADAL Tagging and Parsing tutorial) - spacyr (R interface to spaCy — requires Python) - TreeTagger (command-line, widely used in European corpus linguistics) - Stanford POS Tagger (Java)

Accuracy for English on standard newswire text is typically 97–98% for the best taggers. Accuracy drops on informal text (social media, speech transcripts), historical text, and non-standard varieties.

Lemmatisation

Lemmatisation maps each token to its dictionary headword (lemma): running, ran, runs all map to the lemma run. Lemmatisation is essential for frequency studies — without it, different forms of the same word are counted separately. Most POS taggers include lemmatisation as part of their output.

Dependency parsing

Dependency parsing identifies the syntactic structure of each sentence — specifically, the grammatical relationships between words (subject, object, modifier, etc.). A dependency parse represents the sentence as a directed graph where each word is connected to its syntactic head by a labelled arc.

Dependency parsing enables searches like “find all direct objects of the verb ‘say’” or “find all nouns modified by ‘important’”. It is the basis of most modern syntactic corpus analysis. The Universal Dependencies project (Nivre et al. 2016) provides a cross-linguistically consistent annotation scheme.

Named entity recognition (NER)

NER identifies and classifies proper names in text — people, organisations, locations, dates, monetary values. It is particularly important for corpus compilation: if you want to remove identifying information from participant data, NER can flag proper names for review. NER is also central to many applications in computational social science, where tracking mentions of specific entities across a corpus is the research goal.

Semantic annotation

Beyond lexical categories, semantic annotation labels words or phrases with information about their meaning:

  • Word sense disambiguation — which sense of a polysemous word is intended? (e.g. bank as financial institution vs. riverbank)
  • Semantic role labelling — what role does each phrase play in the event described by the verb? (agent, patient, instrument, location, etc.)
  • Sentiment annotation — positive, negative, or neutral stance?
  • Coreference — which noun phrases refer to the same entity?

Semantic annotation is labour-intensive when done manually and requires substantial expertise. It is typically applied to smaller, highly specialised corpora.

When to annotate

Not all corpora need annotation. Consider annotating when:

  • Your research question requires distinguishing different grammatical uses of the same form (e.g. noun vs. verb uses of round)
  • You want to search for abstract grammatical patterns (e.g. all passive constructions)
  • You need normalised frequency counts (per lemma rather than per wordform)
  • You are studying syntactic phenomena (clause structure, argument realisation)
  • You need to identify and remove proper names for anonymisation

Consider staying with plain text when:

  • Your research question is about surface-level patterns (specific word sequences, character n-grams)
  • The annotation error rate is likely to be high for your text type (e.g. informal social media)
  • You are using the corpus as training data for machine learning (where annotation errors can be harmful)
  • Time and resources are limited and annotation is not essential

Annotation formats

Vertical format (one token per line with annotation columns) is the standard for most corpus tools (McEnery and Wilson 1996):

Token     POS   Lemma
The       DT    the
students  NNS   student
wrote     VBD   write
essays    NNS   essay

CoNLL-U format (used by Universal Dependencies) extends vertical format with fields for ID, form, lemma, UPOS, XPOS, features, head, dependency relation, and miscellaneous information.

Inline XML/TEI format embeds annotation in the text itself: <w pos="NN" lemma="student">students</w>.

Most corpus tools (AntConc, Sketch Engine, CQPweb) expect vertical or inline annotation formats. The LADAL Tagging and Parsing tutorial shows how to produce POS-tagged output with udpipe in R.

Inter-rater reliability

Whenever annotation is performed by human annotators — or when you want to evaluate how well an automated tagger performs on your specific text type — you need to assess inter-rater reliability (IRR): the degree of agreement between two or more annotators applying the same scheme to the same data.

The most widely used measure for categorical annotation is Cohen’s kappa (κ) (Cohen 1960):

  • κ = 1.0: perfect agreement
  • κ > 0.80: strong agreement (generally considered acceptable for publication)
  • 0.60 < κ ≤ 0.80: moderate agreement (may be acceptable depending on task complexity)
  • κ < 0.60: poor agreement (annotation scheme needs revision or annotators need more training)

For sequence labelling tasks (like NER), the F1 score comparing annotator outputs is more commonly used.

Code
# install.packages("irr")
library(irr)

# Example: two annotators labelling 20 tokens as noun/verb/adj/other
annotator_1 <- c("N","N","V","N","Adj","V","N","Other","N","V",
                  "N","Adj","V","N","V","N","N","Other","V","N")
annotator_2 <- c("N","N","V","N","Adj","V","N","N","N","V",
                  "N","Adj","V","Adj","V","N","N","Other","V","N")

# Cohen's kappa
kappa_result <- irr::kappa2(cbind(annotator_1, annotator_2))
print(kappa_result)

# Percentage agreement (simpler but doesn't account for chance)
pct_agree <- mean(annotator_1 == annotator_2)
cat("Percentage agreement:", round(pct_agree * 100, 1), "%\n")
Exercises: Annotation

Q9. A researcher wants to study passive constructions in a corpus of news articles. She finds 342 instances of “was/were + past participle” using a simple regex search. Her supervisor points out that this pattern also matches predicative adjectives (e.g. “The report was detailed”) which are not passive constructions. What is the best solution to this problem?






Quality Control

Section Overview

What you will learn: Why quality control is a necessary stage of corpus compilation, not an optional extra; spot-sampling procedures for checking cleaned texts; consistency checking for metadata; basic corpus statistics as quality indicators; and how to document quality control procedures

Why quality control matters

Even with systematic procedures, errors accumulate during corpus compilation. OCR produces incorrect characters. Cleaning scripts remove too much or too little. Metadata is entered inconsistently. Encoding conversions introduce garbled text. Files are duplicated. Without a dedicated quality control stage, these errors propagate into the analysis and may not be discovered until a reviewer or reader notices something odd — at which point, correcting them requires re-running the entire analysis.

Quality control is not a single step but a mindset: every stage of corpus compilation should include a check of its outputs before proceeding to the next stage.

Spot-sampling

The most practical quality control method for large corpora is systematic spot-sampling: randomly selecting a sample of files and inspecting them manually against the expected output.

A good spot-sampling procedure:

  1. After each major processing step (cleaning, encoding conversion, tokenisation), select a random sample of 5–10% of files
  2. For each sampled file, compare the processed version against the original
  3. Look for: missing content, garbled characters, over-cleaned text, under-cleaned text, incorrect file names
  4. Document any problems found, estimate their prevalence, and decide whether to fix them before proceeding
Code
library(dplyr)
library(readr)

# Random spot-sample of 10 files for manual inspection
set.seed(42)
files_to_check <- corpus_df |>
  slice_sample(n = 10) |>
  pull(filename)

cat("Files selected for spot-check:\n")
cat(paste(files_to_check, collapse = "\n"), "\n\n")

# Print first 200 characters of each for a quick visual check
for (f in files_to_check) {
  raw_path     <- file.path("data/raw",     f)
  cleaned_path <- file.path("data/cleaned", f)

  raw_text     <- readr::read_file(raw_path)
  cleaned_text <- readr::read_file(cleaned_path)

  cat("=== FILE:", f, "===\n")
  cat("RAW (first 200 chars):\n",     substr(raw_text,     1, 200), "\n\n")
  cat("CLEANED (first 200 chars):\n", substr(cleaned_text, 1, 200), "\n\n")
  cat(rep("-", 60), "\n", sep = "")
}

Basic corpus statistics as quality indicators

Computing basic corpus statistics and checking them for anomalies is a fast and effective quality check:

Code
library(dplyr)
library(stringr)
library(ggplot2)

# Compute basic statistics for each file
corpus_stats <- corpus_df |>
  mutate(
    n_chars  = nchar(text_cleaned),
    n_words  = str_count(text_cleaned, "\\S+"),
    n_sents  = str_count(text_cleaned, "[.!?]+\\s"),
    avg_word_len = n_chars / pmax(n_words, 1)
  )

# Summary
summary(corpus_stats[, c("n_chars", "n_words", "n_sents", "avg_word_len")])

# Flag potential outliers: files with very few words (possibly over-cleaned)
# or extremely long files (possibly two documents merged)
outliers <- corpus_stats |>
  filter(n_words < 50 | n_words > quantile(n_words, 0.99))

if (nrow(outliers) > 0) {
  cat("Potential outliers detected (", nrow(outliers), "files):\n")
  print(outliers[, c("filename", "n_words", "n_chars")])
}

# Visualise word count distribution — anomalies show up clearly
ggplot(corpus_stats, aes(x = n_words)) +
  geom_histogram(bins = 50, fill = "#4E79A7", colour = "white") +
  labs(title = "Distribution of word counts across corpus files",
       x = "Words per file", y = "Number of files") +
  theme_minimal()

A healthy corpus shows a roughly unimodal distribution of file lengths. Very short files (< 50 words) may be cleaning artefacts or empty files. Very long files may represent two documents accidentally merged during collection. Bimodal distributions may indicate that the corpus contains two fundamentally different text types that should be in separate subcorpora.

Duplicate detection

Near-duplicate texts inflate corpus frequencies and bias results. After cleaning, always check for duplicates:

Code
library(dplyr)
library(stringr)

# Exact duplicates: same cleaned text
corpus_df <- corpus_df |>
  mutate(text_hash = digest::digest(text_cleaned, algo = "md5"))

exact_dupes <- corpus_df |>
  group_by(text_hash) |>
  filter(n() > 1) |>
  arrange(text_hash)

if (nrow(exact_dupes) > 0) {
  cat("Exact duplicates found:", nrow(exact_dupes), "files\n")
  print(exact_dupes[, c("filename", "text_hash")])
}

# Near-duplicates: very high character overlap
# A simple heuristic: files where the first 500 characters are identical
corpus_df <- corpus_df |>
  mutate(start_500 = substr(text_cleaned, 1, 500))

near_dupes <- corpus_df |>
  group_by(start_500) |>
  filter(n() > 1 & nchar(start_500) > 100)

if (nrow(near_dupes) > 0) {
  cat("Possible near-duplicates found:", nrow(near_dupes), "files\n")
  print(near_dupes[, c("filename")])
}

Documenting quality control

Every quality control step should be documented in your research log:

  • Date and version of the corpus checked
  • Sampling method (random, systematic, or census)
  • Sample size
  • Issues found and their estimated prevalence
  • Decisions made (fix before proceeding? accept known error rate? flag in README?)

This documentation allows readers of your research to assess the reliability of your corpus independently.

Exercises: Quality Control

Q10. After cleaning a corpus of 500 web pages, a researcher computes the word count distribution and finds that 12 files have fewer than 20 words each, while the rest of the corpus averages 800 words per file. What are the two most likely explanations for these very short files, and what should the researcher do?






Section Overview

What you will learn: A six-step framework for planning a corpus-based research project from research question to analysis-ready data; corpus size guidance for different project types; the importance of pilot testing; and a realistic project timeline

Step 1: Define your research question

Everything else follows from a clear research question. Before collecting a single text, you should be able to answer:

  • What linguistic phenomenon am I investigating?
  • What population or language variety does it concern?
  • What claims do I want to test or explore?

Good corpus research questions are specific, answerable with frequency or distributional data, and feasible within your time and resource constraints:

“How do modal verbs differ in L1 versus L2 academic writing?”
“Do male and female bloggers differ in their use of intensifiers?”

Poor corpus research questions are too broad, require methods other than corpus analysis, or are unanswerable with available data:

“How do people use language?” ✗ — far too broad
“What do speakers intend when using X?” ✗ — requires interviews, not corpus analysis

Step 2: Identify your data needs

Your research question determines what data you need:

  • What type of language data? (Written or spoken; which genres; which variety?)
  • What time period is relevant?
  • What speaker or writer characteristics matter? (Age, first language, education level?)
  • How much data? (Consider how frequent the phenomenon you are studying is likely to be)

Example: Research question: “Do male and female bloggers differ in their use of intensifiers?”

Data needs: blog posts; contemporary (recent years); gender-balanced sample; metadata on author gender; sufficient corpus size to capture intensifiers — probably several hundred posts.

Step 3: Determine corpus scope

Project type Recommended scope Notes
Pilot study 50–100 texts Test feasibility before committing to full scale
MA thesis 200–500 texts or 100K–500K words Varies by methodology and discipline
PhD dissertation 500+ texts or 1M+ words Larger scope for stronger generalisability claims

If you will be comparing subgroups, decide whether you need equal amounts from each (balanced) or proportional amounts (representative). Consider whether you want depth (fewer texts analysed intensively) or breadth (more texts with less intensive analysis per text).

Step 4: Plan data collection

  • Where will you get your data? Identify specific sources.
  • What permissions or ethics approvals do you need, and how long will they take?
  • Create a timeline — and be realistic. Data collection almost always takes longer than initially expected.
  • Have a backup plan in case your primary source becomes unavailable.

Step 5: Prepare for data processing

  • What tools will you need? Text editors, Python, AntConc, R?
  • What skills do you need to develop, and how long will that take?
  • Where will the data be stored, and how will it be backed up?
  • How will you document your decisions throughout the process?

Step 6: Pilot test before full-scale collection

This step is the one most commonly skipped — and the one that saves the most pain. Before investing in full-scale data collection:

  1. Collect a small sample (10–20 texts)
  2. Apply your full cleaning and analysis workflow to the sample
  3. Check that the data actually answers your research questions — sometimes pilot testing reveals you need different data than you thought
  4. Revise your collection and processing plans based on what you find
Pilot testing is not optional

Discovering a fundamental problem with your data collection strategy after you have already collected 500 texts is far more costly than discovering it after 10 texts. Even an experienced corpus linguist will pilot-test before committing to full-scale collection.

A realistic project timeline

Here is how time is typically distributed across a small corpus project (8 weeks):

Week Phase Activities
1–2 Planning Define research question; identify sources; obtain ethics approval
3–4 Collection Gather texts; record initial metadata
5–6 Preparation Clean texts; format files; finalise metadata spreadsheet
7 Analysis Concordance searches, frequency analysis, statistical tests
8 Write-up Interpret results; draft report or paper

Notice that preparation (cleaning and formatting) takes two full weeks — as long as collection itself. Researchers routinely underestimate this phase. Budget two to three times your initial estimate for data cleaning and preparation.

Exercises: Planning

Q6. A PhD student tells her supervisor she expects to spend one week collecting data and one week cleaning it before beginning analysis in week three of a 12-week project. Her supervisor says this timeline is unrealistic. Why?






Common Pitfalls

Section Overview

What you will learn: The seven most common mistakes in corpus compilation and how to avoid each one

Pitfall Problem Solution
1. Collecting first, planning later End up with unusable or inappropriate data Define your research question and data needs before collecting a single text
2. Underestimating preparation time Spend 80% of project time cleaning when you expected it to be quick Budget 2–3× your initial estimate for data preparation
3. Inconsistent metadata Cannot filter or compare subgroups Create your metadata spreadsheet at the start; fill it in as you collect each text
4. Poor documentation Six months later, you cannot remember why you made certain decisions Keep a research log; document everything as you go
5. No backup plan Lose access to data source, equipment fails, data gets corrupted Maintain multiple backups; diversify sources if possible
6. Ignoring ethics and copyright Cannot use or publish findings Address legal and ethical issues before collecting
7. Overly ambitious scope Project becomes unmanageable; you never finish Start small; pilot-test to understand what is feasible; expand if needed
Exercises: Pitfalls

Q7. A researcher has been collecting data for three months and has built a large corpus of newspaper articles. She now discovers that most of the articles she collected were from behind a paywall and she does not have permission to use them for research. Which pitfall does this illustrate, and what should she have done differently?






Summary

This tutorial has walked through the complete workflow for compiling a corpus, from the initial research question to a collection of clean, formatted, annotated text files with a well-organised metadata spreadsheet and a shareable folder structure.

The foundations — your corpus is your evidence. The quality of your findings directly depends on the quality of your data. Data preparation is research work, not just technical work.

The five principles — purpose-driven collection, representativeness, comparability, ethical compliance, and documentation — should guide every decision from source selection to metadata coding.

Data sources and corpus size — choose your sources based on your research question; estimate needed corpus size from phenomenon frequency rather than arbitrary word-count targets; consult the table of major public corpora before deciding to build your own; and distinguish balanced from representative designs.

Specialised corpus types — spoken corpora require transcription conventions, time-budgeting for transcription, and rich contextual metadata; web/social media corpora demand attention to API instability and the ethics of contextually private public data; learner corpora need independent proficiency assessment and task standardisation; historical corpora must address OCR quality and spelling normalisation; legal, medical, and parliamentary corpora each have specific access and copyright frameworks; multilingual and parallel corpora require alignment and careful language-variety documentation.

Ethics and legal frameworks — obtain informed consent before collecting participant data; understand the regulatory framework that applies to your jurisdiction (Privacy Act, GDPR); choose an appropriate Creative Commons licence for distribution; and address copyright before collection, not after.

Text cleaning — remove noise while preserving what you are studying; use stringr for systematic, documented cleaning pipelines; convert PDFs with pdftools and Word documents with officer; detect and fix encoding errors with readr::guess_encoding().

Corpus folder structure — every shareable corpus should have a named root folder containing a README, a LICENSE, a metadata file, and a data/ folder. Never overwrite raw data; keep a research log; back up to multiple locations; validate that filenames match metadata with anti_join() before analysis or sharing.

Annotation — choose annotation types appropriate to your research question; POS tagging and lemmatisation are sufficient for most lexical and grammatical studies; dependency parsing is needed for syntactic research; always report inter-rater reliability when human annotation is involved.

Quality control — spot-sample after every major processing step; compute basic corpus statistics and investigate outliers; check for duplicate texts; document every quality control decision in your research log.

Project planning — start with clear research questions, pilot-test before full-scale collection, and budget realistically for data preparation.

Where to go next

Citation & Session Info

Schweinberger, Martin. 2026. Compiling a Corpus: From Texts to Analysis-Ready Data. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/corpuscompilation_tutorial/corpuscompilation_tutorial.html (Version 2026.05.01).

@manual{schweinberger2026corpus,
  author       = {Schweinberger, Martin},
  title        = {Compiling a Corpus: From Texts to Analysis-Ready Data},
  note         = {tutorials/corpuscompilation_tutorial/corpuscompilation_tutorial.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.05.01}
}
AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. The conceptual content, structure, examples, and exercises are based on lecture materials and teaching notes by Martin Schweinberger (SLAT7829 Text Analysis and Corpus Linguistics, Week 4). Claude was used to draft and structure the tutorial text, R code, and exercises. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.

Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] checkdown_0.0.13

loaded via a namespace (and not attached):
 [1] digest_0.6.39       codetools_0.2-20    fastmap_1.2.0      
 [4] xfun_0.56           glue_1.8.0          knitr_1.51         
 [7] htmltools_0.5.9     rmarkdown_2.30      cli_3.6.5          
[10] litedown_0.9        renv_1.1.7          compiler_4.4.2     
[13] rstudioapi_0.17.1   tools_4.4.2         commonmark_2.0.0   
[16] evaluate_1.0.5      yaml_2.3.10         BiocManager_1.30.27
[19] rlang_1.1.7         jsonlite_2.0.0      htmlwidgets_1.6.4  
[22] markdown_2.0       

Back to top

Back to LADAL home


References

Atkins, Sue, Jeremy Clear, and Nicholas Ostler. 1992. “Corpus Design Criteria.” Literary and Linguistic Computing 7 (1): 1–16. https://doi.org/10.1093/llc/7.1.1.
Biber, Douglas, Susan Conrad, and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge Approaches to Linguistics. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511804489.
Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement 20 (1): 37–46. https://doi.org/10.1177/001316446002000104.
Flowerdew, Lynne. 2012. Corpora and Language Education. Research and Practice in Applied Linguistics. Basingstoke: Palgrave Macmillan. https://doi.org/10.1057/9780230355569.
Garside, Roger, Geoffrey Leech, and Anthony McEnery, eds. 1997. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman.
Hunston, Susan. 2002. Corpora in Applied Linguistics. Cambridge Applied Linguistics. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139524773.
Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation.” In Proceedings of Machine Translation Summit x: Papers, 79–86. Phuket, Thailand. https://aclanthology.org/2005.mtsummit-papers.11/.
MacWhinney, Brian. 2000. The CHILDES Project: Tools for Analyzing Talk. 3rd ed. Mahwah, NJ: Lawrence Erlbaum Associates.
McEnery, Tony, and Andrew Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge Textbooks in Linguistics. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511981395.
McEnery, Tony, and Andrew Wilson. 1996. Corpus Linguistics: An Introduction. Edinburgh Textbooks in Empirical Linguistics. Edinburgh: Edinburgh University Press.
Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, et al. 2016. Universal Dependencies v1: A Multilingual Treebank Collection.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1659–66. Portorož, Slovenia: European Language Resources Association (ELRA). https://aclanthology.org/L16-1262/.
Sinclair, John. 1991. Corpus, Concordance, Collocation. Describing English Language. Oxford: Oxford University Press.