This tutorial introduces methods for analysing learner language — the written and spoken production of second language (L2) learners — using R. Learner language, also called interlanguage(Selinker, Swain, and Dumas 1975), is the systematic, rule-governed variety of language produced by a learner at a given point in their development. It differs from the target language in predictable ways that reflect the learner’s evolving grammatical and lexical knowledge, transfer from their first language (L1), and the specific instructional and communicative contexts they have encountered.
Corpus-based approaches to learner language — using collections of authentic learner texts known as learner corpora — have grown substantially since the 1990s (Granger 2009). They allow researchers to move beyond anecdotal observation and examine patterns of learner language systematically, at scale, and in comparison with the production of native or proficient speakers. The availability of well-annotated learner corpora and powerful R packages makes it possible to carry out many standard learner corpus analyses reproducibly, with relatively compact code.
This tutorial covers seven core analysis types, progressing from basic frequency-based methods to more linguistically sophisticated measures:
Concordancing — extracting and inspecting keyword-in-context (KWIC) lines
Frequency lists — ranking words by frequency, with and without stopwords
Sentence length — computing and comparing average sentence length across L1 groups
N-gram analysis — extracting bigrams and comparing their use between learners and L1 speakers
Collocations and collocation networks — identifying strongly co-occurring word pairs and visualising their relationships
Part-of-speech tagging and POS-sequence analysis — automatically tagging word classes and comparing grammatical patterns
Lexical diversity and readability — quantifying vocabulary richness and text complexity
Prerequisite Tutorials
Before working through this tutorial, please ensure you are familiar with:
Extract and sort KWIC concordances for words and phrases
Build and visualise frequency lists with and without stopwords
Compute normalised sentence length and compare it across L1 groups
Extract bigrams, normalise their frequencies, and test for significant learner–L1 differences using Fisher’s exact test with Bonferroni correction
Identify collocations using log-likelihood and visualise them as network graphs
POS-tag texts with udpipe and compare POS-sequence bigrams across groups
Calculate and compare multiple lexical diversity measures and Flesch readability scores
Detect and quantify spelling errors using hunspell
Interpret the results of these analyses in the context of second language acquisition research
Citation
Schweinberger, Martin. 2026. Analysing Learner Language with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/llr/llr.html (Version 2026.02.24).
Background: Learner Corpora
Section Overview
What you’ll learn: What learner corpora are, how they differ from native-speaker corpora, and an introduction to the two corpora used in this tutorial
A learner corpus is a principled, machine-readable collection of texts produced by L2 learners of a target language (Granger 2009). Learner corpora can contain written essays, spoken transcripts, or both. They are typically annotated with metadata about the learner — their L1 background, proficiency level, age, educational context, and the task type that elicited the text. This metadata makes it possible to compare production across L1 groups, proficiency levels, or task types, and to contrast learner production with that of native or expert speakers.
The analysis of learner corpora is central to the field of Learner Corpus Research (LCR), which applies corpus linguistic methods to questions in second language acquisition (SLA), language pedagogy, and language testing (Gilquin and Granger 2015). Common research questions include: Do learners over- or under-use certain words, constructions, or discourse markers relative to native speakers? Does lexical diversity increase with proficiency? Are spelling or grammatical error rates associated with L1 background? Do learners from different L1 backgrounds show distinct error profiles?
The ICLE and LOCNESS Corpora
This tutorial uses data from two well-known learner corpus resources:
ICLE — International Corpus of Learner English(Granger et al. 1993). ICLE contains argumentative essays written by advanced university-level EFL learners from 16 different L1 backgrounds. The essays in this tutorial are drawn from the German, Spanish, French, Italian, Polish, and Russian sub-corpora of ICLE.
LOCNESS — Louvain Corpus of Native English Essays(Granger, Sanders, and Connor 2005). LOCNESS contains essays written by native speakers of English, including British A-level students (the sub-corpus used here) and American university students. It provides the native-speaker baseline against which the ICLE learner data is compared.
Both corpora were compiled by the Centre for English Corpus Linguistics (CECL) at the Université catholique de Louvain, Belgium. The essays in this tutorial deal with the topic of transport, which is one of the prompt topics common to both ICLE and LOCNESS, enabling direct comparison between learner and native-speaker production on the same task.
Data Access
The ICLE and LOCNESS corpora are commercially licensed and cannot be distributed freely. In this tutorial we load pre-processed sub-samples hosted on the LADAL server, which is sufficient to follow all examples. If you wish to work with the full corpora, contact the CECL at UCLouvain.
Setup
Installing Packages
Code
# Run once to install — comment out after installationinstall.packages("quanteda")install.packages("quanteda.textstats")install.packages("quanteda.textplots")install.packages("tidyverse")install.packages("flextable")install.packages("tidytext")install.packages("udpipe")install.packages("koRpus")install.packages("stringi")install.packages("hunspell")install.packages("wordcloud2")install.packages("tokenizers")install.packages("checkdown")
Loading Packages
Code
# Load at the start of every sessionlibrary(tidyverse) # dplyr, ggplot2, stringr, tidyrlibrary(flextable) # formatted display tableslibrary(tidytext) # stop_words and tidy text utilitieslibrary(udpipe) # POS tagginglibrary(quanteda) # corpus and KWIC infrastructurelibrary(quanteda.textstats) # textstat_collocations, textstat_readability, textstat_lexdivlibrary(quanteda.textplots) # textplot_xray, textplot_networklibrary(koRpus) # lexical diversity measureslibrary(stringi) # string reversal for sorted concordanceslibrary(hunspell) # spell checkinglibrary(wordcloud2) # word cloud visualisationlibrary(tokenizers) # sentence splitting and word countinglibrary(checkdown) # interactive quiz questions
Loading the Data
We load seven essay files: two from LOCNESS (L1 British English, split across two files) and one each from six ICLE sub-corpora representing learners whose L1 is German, Spanish, French, Italian, Polish, or Russian.
[1] "It is now a very wide spread opinion, that in the modern world there is no place for dreaming and imagination. Those who share this point of view usually say that at present we are so very much under the domination of science, industry, technology, ever-increasing tempo of our lives and so on, that neither dreaming nor imagination can possibly survive. Their usual argument is very simple - they suggest to their opponents to look at some samples of the modern art and to compare them to the masterpieces of the \"Old Masters\" of painting, music, literature."
[2] "As everything which is simple, the argument sounds very convincing. Of course, it is evident, that no modern writer, painter or musician can be compare to such names as Bach, Pushkin< Byron, Mozart, Rembrandt, Raffael et cetera. Modern pictures, in the majority of cases, seem to be merely repetitions or combinations of the images and methods of painting, invented very long before. The same is also true to modern verses, novels and songs."
[3] "But, I think, those, who put forward this argument, play - if I may put it like this - not fair game with their opponents, because such an approach presupposes the firm conviction, that dreaming and imagination can deal only with Arts, moreover, only with this \"well-established set\" of Arts, which includes music, painting, architecture, sculpture and literature. That is, a person, who follows the above-mentioned point of view tries to make his opponent take for granted the statement, the evidence of which is, to say the least, doubtful."
[4] "But actually, these are not only music, painting, writing which are the spheres to which one's dreaming and imagination can be applied. First of all, there are quite a few other \"arts\". Probably, these are not as well-established as those mentioned above, but they are also \"arts\", and besides, they flourish nowdays. Let us take cinema, for example. Originally, in the beginning of the century, it was only an entertainment for common people. But today we may already call it \"art\". Now it has a hundred years of its history, and we all know, that cinema is really able to create masterpieces. Our contemporaries, film-directors of this \"dominated by science, technology and industrialization\" century create at present films like \"The Good, The Bad, The Ugly\", \"8 ½\", \"Once Upon a Time in America\", \"Shining\" et cetera, which, perhaps, in the next century will be regarded by our descendants just like w regard Raffael's pictures or Shakespeare's sonnets now."
[5] "By the way, I can hardly observe any connection between the \"domination of science, technology and industrialization\" over our world and the decline of arts. There may be some other reasons for it, but niether science nor technology themselves. Moreover, scientific and technological progress may very often create new means and new ways to apply one's imagination and \"materialize\"one's dream. For example, the newest computer technology, I think, has already created a whole range of ways to apply one's imagination - the computer design, animation and plenty of other things - and still has a great capasity for developement."
The text is stored as a character vector where each element is typically a paragraph or a short passage. The <ICLE-...> header tag that appears at the top of each file encodes metadata (the learner’s L1, proficiency level, and essay topic); we strip this in subsequent processing steps.
We also create two combined objects — one pooling all L1 texts and one pooling all learner texts — which we will use throughout:
Code
l1 <-c(ns1, ns2) # all native-speaker textlearner <-c(de, es, fr, it, pl, ru) # all learner text
Concordancing
Section Overview
What you’ll learn: How to extract keyword-in-context (KWIC) concordances for individual words and phrases, how to sort them by preceding or following context, and how to visualise dispersion patterns across texts
Key function:quanteda::kwic()
Concordancing — the extraction of words or phrases from a corpus together with their surrounding context — is one of the most fundamental operations in corpus linguistics (Lindquist 2009). A keyword-in-context (KWIC) display places the search term in the centre of a fixed-width window of preceding and following tokens, making it easy to inspect how a word is actually used across many instances. KWIC concordances are useful for: verifying that a search pattern is returning the intended tokens; examining how learners use a specific word or construction compared to L1 speakers; extracting authentic examples for pedagogical or analytical purposes; and as a preliminary step before more quantitative analyses.
Extracting a KWIC for a Single Word
We use quanteda::kwic() to extract concordance lines for the word problem and its morphological variants (e.g. problems) in the learner corpus. The pattern argument accepts a regular expression when valuetype = "regex" is set. A window of 10 tokens gives 10 words of left and right context.
docname pre keyword
1 text12 Many of the drug addits have legal problems
2 text12 countries , like Spain , illegal . They have social problems
3 text30 In our society there is a growing concern about the problem
4 text33 that once the availability of guns has been removed the problem
5 text33 honest way and remove any causes that could worsen a problem
6 text34 violence in our society . In order to analise the problem
post
1 because they steal money for buying the drug that is
2 too because people are afraid of them and the drug
3 of violent crime . In fact , particular attention is
4 of violence simply vanishes , but in this caotic situation
5 which is already particularly serious .
6 in its complexity and allow people to live in a
The output table has one row per match. The pre column shows the left context, keyword the matched token, and post the right context. The docname column identifies which document in the corpus the match came from.
Sorting Concordances
One of the most useful things to do with a concordance is sort it. Sorting by right context (the words immediately following the keyword) reveals collocational patterns to the right; sorting by reversed left context reveals patterns to the left.
Code
# Sort by right context (alphabetically by first word after keyword)kwic_prob |> dplyr::arrange(post) |>head(8)
docname pre keyword
1 text12 Many of the drug addits have legal problems
2 text39 , greatest ideas were produced and solutions to many serious problems
3 text34 violence in our society . In order to analise the problem
4 text33 that once the availability of guns has been removed the problem
5 text30 In our society there is a growing concern about the problem
6 text12 countries , like Spain , illegal . They have social problems
7 text33 honest way and remove any causes that could worsen a problem
post
1 because they steal money for buying the drug that is
2 found . Most wonderful pieces of literature were created in
3 in its complexity and allow people to live in a
4 of violence simply vanishes , but in this caotic situation
5 of violent crime . In fact , particular attention is
6 too because people are afraid of them and the drug
7 which is already particularly serious .
Code
# Sort by reversed left context (reveals patterns immediately before keyword)kwic_prob |> dplyr::mutate(prerev = stringi::stri_reverse(pre)) |> dplyr::arrange(prerev) |> dplyr::select(-prerev) |>head(8)
docname pre keyword
1 text33 honest way and remove any causes that could worsen a problem
2 text33 that once the availability of guns has been removed the problem
3 text34 violence in our society . In order to analise the problem
4 text30 In our society there is a growing concern about the problem
5 text12 Many of the drug addits have legal problems
6 text12 countries , like Spain , illegal . They have social problems
7 text39 , greatest ideas were produced and solutions to many serious problems
post
1 which is already particularly serious .
2 of violence simply vanishes , but in this caotic situation
3 in its complexity and allow people to live in a
4 of violent crime . In fact , particular attention is
5 because they steal money for buying the drug that is
6 too because people are afraid of them and the drug
7 found . Most wonderful pieces of literature were created in
Sorting by reversed left context is equivalent to sorting from the right edge of the left context inwards — it groups instances that share the same immediately preceding word (e.g. the problem, a problem, this problem), which is particularly useful for studying determiner or modifier patterns.
Visualising Dispersion
The textplot_xray() function from quanteda.textplots produces a dispersion plot (sometimes called a lexical dispersion plot) — a visual representation of where in each document a given term appears. Each vertical tick mark represents one occurrence; the horizontal axis represents the position within the document as a proportion of its total length.
Code
# Extract KWIC for two terms to compare their dispersion across learner textskwic_disp <- quanteda::kwic( quanteda::tokens(learner),pattern =c("people", "imagination"))quanteda.textplots::textplot_xray(kwic_disp) +theme_bw() +labs(title ="Dispersion of 'people' and 'imagination' in learner texts")
Concordancing Phrases
To search for a multi-word sequence, wrap the pattern in quanteda::phrase(). The example below retrieves all instances of very followed by any single word:
quanteda::kwic() supports three valuetype options: "fixed" (exact string match), "glob" (wildcard with * and ?), and "regex" (full regular expression). For most single-word searches, "glob" with a trailing * (e.g. "problem*") is the simplest option. Use "regex" when you need character classes, alternation (|), or more complex patterns. Use phrase() together with "regex" for multi-word patterns.
Check Your Understanding: Concordancing
Q1. You want to find all instances of the phrase in my opinion in a learner corpus. Which quanteda::kwic() call is correct?
Q2. What is the purpose of reversing the left context (stringi::stri_reverse(pre)) before sorting a concordance?
Frequency Lists
Section Overview
What you’ll learn: How to build a word frequency list from corpus text, how to remove stopwords to reveal content-word distributions, and how to visualise frequency rankings as bar charts and word clouds
Frequency lists — ranked lists of all word types and their occurrence counts — are among the most basic but informative tools in corpus linguistics. They give an immediate picture of the vocabulary profile of a text or corpus: which words are most common, how quickly frequency drops off across the vocabulary, and how a corpus’s most frequent words compare to those of another corpus. Comparing the frequency profiles of learner and native-speaker texts can reveal systematic over- or under-use of specific words or word types.
Building a Frequency List
We build a frequency list for the pooled L1 (LOCNESS) data. The pipeline removes punctuation, normalises case, splits into word tokens, and counts.
Code
ftb <-c(ns1, ns2) |> stringr::str_replace_all("\\W", " ") |># replace non-word characters with spaces stringr::str_squish() |># collapse multiple spacestolower() |># normalise to lowercase stringr::str_split(" ") |># split into word tokensunlist() |>as.data.frame() |> dplyr::rename(word =1) |> dplyr::filter(word !="") |> dplyr::count(word, name ="freq") |> dplyr::arrange(desc(freq))head(ftb, 10)
word freq
1 the 650
2 to 373
3 of 320
4 and 283
5 is 186
6 a 176
7 in 162
8 be 121
9 this 120
10 are 111
The most frequent words are, unsurprisingly, grammatical function words: the, a, of, and, to. These are important for syntactic processing but carry little lexical content and are rarely of interest when studying vocabulary use.
Removing Stopwords
We use dplyr::anti_join() with the stop_words lexicon from the tidytext package to remove high-frequency function words, leaving only content words.
Code
ftb_content <- ftb |> dplyr::anti_join(stop_words, by ="word")head(ftb_content, 10)
word freq
1 transport 98
2 people 85
3 roads 80
4 cars 69
5 road 51
6 system 50
7 rail 48
8 traffic 45
9 public 41
10 trains 36
The content-word list now surfaces topically informative vocabulary. Because the essays are about transport, we expect words like transport, road, car, public, and people to feature prominently — and indeed they do. This topical coherence also serves as a useful sanity check that the data have been loaded and processed correctly.
Visualising Frequency as a Bar Chart
Code
ftb_content |>head(20) |>ggplot(aes(x =reorder(word, -freq), y = freq, label = freq)) +geom_col(fill ="#2166AC") +geom_text(vjust =1.5, colour ="white", size =3) +theme_bw() +theme(axis.text.x =element_text(size =9, angle =45, hjust =1)) +labs(title ="Top 20 Content Words in L1 English (LOCNESS) Essays",subtitle ="Stopwords removed; essays on the topic of transport",x ="Word",y ="Frequency" )
Visualising Frequency as a Word Cloud
Word clouds provide a visually engaging alternative for communicating frequency information. Word size is proportional to frequency. They are best suited for communication and initial exploration rather than precise quantitative comparison.
Word clouds encode frequency information through font size, which is difficult to compare precisely across words. They are useful for getting a quick visual impression of vocabulary but should not be used for quantitative claims. For rigorous frequency comparisons, use bar charts or tables with exact counts.
Check Your Understanding: Frequency Lists
Q3. You compare the raw frequency list and the content-word list for an English corpus and notice that the appears 4,200 times in the raw list but is absent from the content-word list. What caused the removal?
Sentence Length
Section Overview
What you’ll learn: How to split texts into individual sentences, how to count words per sentence, and how to compare sentence length distributions across L1 groups using box plots
Average sentence length (ASL) is one of the most widely used surface measures of syntactic complexity in learner language research. Longer sentences generally involve more syntactic subordination and coordination, which are markers of greater grammatical proficiency (wolfe1998linguistic?). Comparing ASL across L1 groups and between learners and native speakers can reveal whether certain L1 backgrounds are associated with shorter, simpler sentences or longer, more complex ones.
Splitting Texts into Sentences
We write a reusable cleaning function that removes ICLE/LOCNESS file headers, strips internal quotation marks that can confuse sentence boundary detection, and then splits the text into individual sentences using tokenizers::tokenize_sentences().
Code
cleanText <-function(x) { x <-paste0(x) x <- stringr::str_remove_all(x, "<.*?>") # remove XML/HTML-style tags x <- stringr::str_remove_all(x, fixed("\"")) # remove double quotation marks x <- x[x !=""] # drop empty strings x <- tokenizers::tokenize_sentences(x) x <-unlist(x)return(x)}# Apply to all textsns1_sen <-cleanText(ns1); ns2_sen <-cleanText(ns2)de_sen <-cleanText(de); es_sen <-cleanText(es)fr_sen <-cleanText(fr); it_sen <-cleanText(it)pl_sen <-cleanText(pl); ru_sen <-cleanText(ru)
ru_sen
It is now a very wide spread opinion, that in the modern world there is no place for dreaming and imagination.
Those who share this point of view usually say that at present we are so very much under the domination of science, industry, technology, ever-increasing tempo of our lives and so on, that neither dreaming nor imagination can possibly survive.
Their usual argument is very simple - they suggest to their opponents to look at some samples of the modern art and to compare them to the masterpieces of the Old Masters of painting, music, literature.
As everything which is simple, the argument sounds very convincing.
Of course, it is evident, that no modern writer, painter or musician can be compare to such names as Bach, Pushkin< Byron, Mozart, Rembrandt, Raffael et cetera.
Computing Sentence Lengths
tokenizers::count_words() returns the number of whitespace-delimited tokens in each sentence. We collect these counts into a single data frame with an l1 column identifying the speaker group.
sentenceLength l1
1 2 en
2 17 en
3 23 en
4 17 en
5 20 en
6 34 en
Visualising Sentence Length Distributions
Box plots are ideal for comparing distributions across groups: they display the median, interquartile range, and outliers simultaneously, making differences in central tendency and spread immediately visible.
Code
l1_labels <-c("en"="English", "de"="German", "es"="Spanish","fr"="French", "it"="Italian", "pl"="Polish", "ru"="Russian")sl_df |>ggplot(aes(x =reorder(l1, -sentenceLength, mean), y = sentenceLength, fill = l1)) +geom_boxplot(outlier.alpha =0.3, outlier.size =1) +scale_x_discrete("L1 background", labels = l1_labels) +scale_fill_brewer(palette ="Set2", guide ="none") +theme_bw() +labs(title ="Sentence Length Distributions by L1 Background",subtitle ="L1 groups ordered by mean sentence length (descending)",y ="Sentence length (words)" )
The plot reveals considerable variation both within and across groups. L1 English speakers tend to produce longer sentences on average than most learner groups, which may reflect greater syntactic command of subordination and embedding. However, the wide interquartile ranges and overlapping distributions remind us that sentence length alone is a noisy and indirect indicator of proficiency — a very short sentence may be stylistically deliberate, and a very long sentence may be run-on or poorly constructed.
Check Your Understanding: Sentence Length
Q4. A researcher finds that Polish learners produce significantly shorter sentences than L1 English speakers on average. Which of the following would be the most appropriate immediate follow-up analysis?
N-gram Analysis
Section Overview
What you’ll learn: How to extract word bigrams from corpus texts, how to normalise their frequencies for cross-corpus comparison, and how to identify which bigrams are used significantly more or less often by learners compared to L1 speakers using Fisher’s exact test with Bonferroni correction
N-grams are contiguous sequences of n words. Bigrams (2-grams) capture word pairs; trigrams (3-grams) capture three-word sequences; and so on. N-gram analysis is used in learner corpus research to identify formulaic sequences — multi-word units that learners may use differently from native speakers. Over-use or under-use of certain bigrams can indicate L1 transfer effects, limited access to target-language formulaic sequences, or avoidance of specific collocational patterns.
Extracting Bigrams
We tokenise each sentence set, apply tokens_ngrams(n = 2), and then build a unified data frame with an l1 column and a binary learner column.
ngram l1 learner
ns11 transport_01 ns1 no
ns12 the_basic ns1 no
ns13 basic_dilema ns1 no
ns14 dilema_facing ns1 no
ns15 facing_the ns1 no
ns16 the_uk's ns1 no
Normalising Frequencies
Because the L1 and learner sub-corpora differ in total word count, we cannot compare raw bigram frequencies directly. We normalise to per-1,000-word rates by dividing each group’s bigram count by the total number of bigrams produced by that group, then multiplying by 1,000.
Visual comparison is informative but we need a statistical test to determine which differences are unlikely to be due to chance. We use Fisher’s exact test, which is appropriate for count data in 2×2 contingency tables. For each bigram, the table compares: (a) its count in L1 data vs. (b) all other bigrams in L1 data, against (c) its count in learner data vs. (d) all other bigrams in learner data.
Because we run one test per bigram (potentially thousands of tests), we apply a Bonferroni correction: the critical p-value is divided by the number of tests, making the threshold much more stringent and controlling the family-wise error rate.
Code
# Reshape for Fisher's test: one row per bigram, counts for each groupfisher_df <- ngram_freq |> tidyr::pivot_wider(names_from = learner, values_from = freq, values_fill =0) |> dplyr::rename(l1speaker = no, learner = yes) |> dplyr::ungroup() |> dplyr::mutate(total_l1 =sum(l1speaker),total_learner =sum(learner),a = l1speaker,b = learner,c = total_l1 - l1speaker,d = total_learner - learner )
# How many bigrams reach significance after Bonferroni correction?table(fisher_results$sig_corr)
n.s.
9036
Interpreting Fisher’s Exact Test Results
A significant result (p < Bonferroni-corrected threshold) means that the observed difference in bigram frequency between learners and L1 speakers is unlikely to have arisen by chance, given the corpus sizes. The odds ratio indicates the direction: an odds ratio > 1 means the bigram is proportionally more common in L1 speech; an odds ratio < 1 means it is proportionally more common in learner speech.
In small sub-corpora like the present data, it is common to find few or no significant results after Bonferroni correction, because the test is very conservative when the number of comparisons is large. Larger corpora or a less conservative correction (e.g. the Benjamini-Hochberg false discovery rate procedure) would typically reveal more significant differences.
Check Your Understanding: N-gram Analysis
Q5. Why is Bonferroni correction necessary when testing bigram frequency differences across an entire frequency list?
Collocations and Collocation Networks
Section Overview
What you’ll learn: How to identify statistically significant word collocations using log-likelihood, how to build a feature co-occurrence matrix, and how to visualise collocational relationships as a network graph
A collocation is a pair (or larger group) of words that co-occur more frequently than chance would predict. Identifying collocations is important in learner language research because learners often struggle with the conventional collocational patterns of the target language — they may produce grammatically correct but collocationally unusual combinations (e.g. make homework instead of do homework) that mark their output as non-native (nesselhauf2005collocations?).
Identifying Collocations
We use quanteda.textstats::textstat_collocations(), which computes the lambda statistic (a log-likelihood-based association measure) for all word pairs appearing at least min_count times. Higher lambda values indicate stronger association — the pair co-occurs much more frequently than expected by chance.
Code
# Combine L1 sentences and tokenisens_sen <-c(ns1_sen, ns2_sen)ns_tokens <- quanteda::tokens(tolower(ns_sen), remove_punct =TRUE)# Identify collocations occurring at least 20 timesns_coll <- quanteda.textstats::textstat_collocations(ns_tokens, size =2, min_count =20)
collocation
count
count_nested
length
lambda
z
public transport
35
0
2
7.1702271
14.899245
it is
23
0
2
3.1190429
12.152647
of the
72
0
2
1.6178618
11.456242
to use
21
0
2
3.4556530
10.583556
number of
32
0
2
5.6938299
10.063860
on the
31
0
2
2.1031266
9.433550
in the
39
0
2
1.6728346
8.874487
with the
24
0
2
2.3004624
8.817610
the number
20
0
2
2.9141572
8.593513
to the
40
0
2
0.6685972
3.880229
The strongest collocations in this transport-themed corpus include named entities and set phrases specific to the topic. This is typical for domain-specific corpora: the most frequent collocations tend to be topically coherent multi-word units.
Building a Collocation Network
A collocation network visualises the collocational relationships around a target term as a graph, where nodes are words and edges represent co-occurrence strength. This makes it easy to see at a glance which words cluster together around a focal term.
External Script Dependency
The code below uses the calculateCoocStatistics() function, which is not part of any CRAN package. It is available as a standalone R script (rscripts/calculateCoocStatistics.R) in the LADAL repository. Download it from the LADAL GitHub repository and place it in a sub-folder called rscripts/ within your R project before running this section. The function calculates log-likelihood-based co-occurrence statistics between a target term and all other terms in a Document-Feature Matrix.
Code
# Build a DFM from the L1 sentences (stopwords removed)ns_dfm <- quanteda::dfm( quanteda::tokens(ns_sen, remove_punct =TRUE)) |> quanteda::dfm_remove(pattern =stopwords("english"))
Code
# Load the co-occurrence statistics functionsource("rscripts/calculateCoocStatistics.R")# Compute log-likelihood co-occurrence statistics for the target term "transport"coocTerm <-"transport"coocs <-calculateCoocStatistics(coocTerm, ns_dfm, measure ="LOGLIK")# Inspect the top 10 collocatescoocs[1:10]
public use traffic rail facing commuters cheaper
113.171974 19.437311 10.508626 9.652830 9.382889 9.382889 9.382889
roads less buses
9.080648 8.067363 6.702863
Code
# Reduce DFM to the top 10 collocates plus the target wordredux_dfm <- quanteda::dfm_select( ns_dfm,pattern =c(names(coocs)[1:10], coocTerm))# Convert to Feature Co-occurrence Matrix (FCM)# FCM[i,j] = number of documents in which word i and word j both appeartag_fcm <- quanteda::fcm(redux_dfm)
In the network, the size of each node’s label is proportional to its overall frequency in the DFM (a log-scaled version of the row sums). Edges between nodes represent co-occurrence in the same document. The target term transport should appear as the most central node, with the strongest edges connecting to its most frequent and most strongly associated collocates.
Check Your Understanding: Collocations
Q6. A learner writes “she did a big mistake” instead of the native-speaker form “she made a big mistake”. What type of error does this represent, and which corpus analysis method would be most appropriate to study it systematically?
Part-of-Speech Tagging and POS-Sequence Analysis
Section Overview
What you’ll learn: How to automatically assign part-of-speech tags to corpus texts using udpipe, how to extract POS-tag bigrams, how to compare their frequencies between learners and L1 speakers, and how to use KWIC to inspect the actual words behind significant differences
Key package:udpipe
Part-of-speech (POS) tagging is the automatic assignment of grammatical category labels (noun, verb, adjective, etc.) to each token in a text. Comparing POS-tag sequences between learner and native-speaker texts reveals grammatical differences that are invisible at the word level — for example, differences in the rate of adjective use, the frequency of passive constructions, or the distribution of subordinating conjunctions. POS-based analysis is particularly powerful for studying grammatical complexity and for identifying constructions that transfer from the learner’s L1.
Required: udpipe Language Model
This section requires a pre-trained udpipe language model for English. If you have not already downloaded it, run the following code once:
This downloads the English EWT (English Web Treebank) model to your working directory. Subsequent sessions should load it directly using udpipe_load_model() with the path to the downloaded .udpipe file, as shown below. The file is approximately 16 MB. Do not re-download it in every session — the download creates a local copy that persists between sessions.
Testing POS Tagging on a Sample Sentence
Before tagging the full corpus, we test the tagger on a single sentence to inspect the output format and verify that tagging is working as expected.
Code
# Load the pre-downloaded English EWT modelm_eng <- udpipe::udpipe_load_model(file = here::here("udpipemodels", "english-ewt-ud-2.5-191206.udpipe"))# Tag a test sentencetest_sentence <-"It is now a very wide-spread opinion that in the modern world there is no place for dreaming and imagination."tagged_test <- udpipe::udpipe_annotate(m_eng, x = test_sentence) |>as.data.frame() |> dplyr::select(token, upos, xpos, dep_rel)head(tagged_test, 12)
token upos xpos dep_rel
1 It PRON PRP nsubj
2 is AUX VBZ cop
3 now ADV RB advmod
4 a DET DT det
5 very ADV RB advmod
6 wide ADJ JJ amod
7 - PUNCT HYPH punct
8 spread NOUN NN compound
9 opinion NOUN NN root
10 that PRON WDT obj
11 in ADP IN case
12 the DET DT det
The output contains several annotation columns. upos is the Universal POS tag (a coarse, cross-linguistically consistent tag set: NOUN, VERB, ADJ, etc.). xpos is the Penn Treebank tag (a finer-grained English-specific tag set: NN, VBZ, JJ, etc.). dep_rel is the dependency relation label. For our bigram analysis we use xpos tags, in which adjectives are tagged JJ, present-tense verbs VBZ, personal pronouns PRP, and so on.
We create a tagged version of the text by concatenating each token with its xpos tag, separated by /:
We write a function comText() that cleans a text, runs udpipe annotation, and returns the token/tag string. We then apply it to all eight text sets.
Code
comText <-function(x) { x <-paste0(x, collapse =" ") x <- stringr::str_remove_all(x, "<.*?>") # remove markup tags x <- stringr::str_remove_all(x, fixed("\"")) # remove quotation marks x <- stringr::str_squish(x) x <- x[x !=""] annotated <- udpipe::udpipe_annotate(m_eng, x = x) |>as.data.frame()paste0(annotated$token, "/", annotated$xpos, collapse =" ")}# Apply to all texts (this step takes a few minutes per text)ns1_pos <-comText(ns1_sen); ns2_pos <-comText(ns2_sen)de_pos <-comText(de_sen); es_pos <-comText(es_sen)fr_pos <-comText(fr_sen); it_pos <-comText(it_sen)pl_pos <-comText(pl_sen); ru_pos <-comText(ru_sen)# Preview the first 300 characters of the L1 tagged textsubstr(ns1_pos, 1, 300)
# A tibble: 8 × 3
ngram learner freq
<chr> <chr> <int>
1 DT_NN no 520
2 IN_DT no 465
3 NN_IN no 465
4 JJ_NN no 334
5 IN_NN no 241
6 DT_JJ no 236
7 TO_VB no 235
8 NNS_IN no 222
Testing for Significant POS-Sequence Differences
We apply the same Fisher’s exact test + Bonferroni correction approach as in the N-gram section, but now operating on POS-tag bigrams rather than word bigrams.
When a POS-tag bigram differs significantly between groups, the next step is to inspect the actual word tokens that realise that sequence using KWIC. For example, the sequence PRP_VBZ (personal pronoun + third-person singular present verb, as in she runs, it matters) can be searched directly in the tagged text.
[1] pre keyword post
<0 rows> (or 0-length row.names)
Comparing the two concordances reveals whether the same syntactic pattern is associated with different lexical choices in learner and L1 production — for instance, whether learners favour different pronoun–verb combinations or use the pattern in different pragmatic contexts.
Check Your Understanding: POS Tagging
Q7. Why do we strip the word tokens (e.g. convert it/PRP to just PRP) before extracting POS-tag bigrams in the sequence analysis?
Lexical Diversity and Readability
Section Overview
What you’ll learn: How to compute multiple lexical diversity measures (TTR, CTTR, Herdan’s C, Guiraud’s R, Maas) and Flesch readability scores for each essay, and how to compare these measures across L1 groups
Key packages:koRpus, quanteda.textstats
Lexical Diversity
Lexical diversity refers to the range and variety of vocabulary used in a text. It is considered an important indicator of L2 proficiency: more proficient learners tend to deploy a wider range of word types and rely less heavily on a small set of high-frequency forms (daller2003assessing?). Numerous measures of lexical diversity have been proposed, each capturing a slightly different aspect of vocabulary range:
Lexical diversity measures
Measure
Formula
Notes
TTR
V / N
Types / Tokens; decreases with text length
CTTR
V / √(2N)
Carroll’s Corrected TTR; partially controls for length
C (Herdan)
log(V) / log(N)
LogTTR; more robust to length than TTR
R (Guiraud)
V / √N
Root TTR
U (Dugast)
log²(N) / (log(N) − log(V))
Uber index
Maas
(log(N) − log(V)) / log²(N)
Inverse measure; lower = more diverse
Splitting Texts into Individual Essays
Before computing lexical diversity, we split each text file into individual essays. In the ICLE/LOCNESS data, essays are separated by headers of the form Transport 01, Transport 02, etc.
Code
cleanEss <-function(x) { x |>paste0(collapse =" ") |> stringr::str_split("Transport [0-9]{1,2}") |>unlist() |> stringr::str_squish() |> (\(v) v[v !=""])()}ns1_ess <-cleanEss(ns1); ns2_ess <-cleanEss(ns2)de_ess <-cleanEss(de); es_ess <-cleanEss(es)fr_ess <-cleanEss(fr); it_ess <-cleanEss(it)pl_ess <-cleanEss(pl); ru_ess <-cleanEss(ru)# Preview the first essay from the L1 datasubstr(ns1_ess[1], 1, 400)
[1] "The basic dilema facing the UK's rail and road transport system is the general rise in population. This leads to an increase in the number of commuters and transport users every year, consequently putting pressure on the UKs transports network. The biggest worry to the system is the rapid rise of car users outside the major cities. Most large cities have managed to incourage commuters to use publi"
Computing Lexical Diversity Measures
We use koRpus::lex.div(), which computes multiple lexical diversity measures simultaneously. The segment and window arguments control the size of the moving window used for measures that are computed incrementally.
Since ld_all was already constructed with L1 labels in the previous step, the ld5 chunk simplifies to a straightforward select() — no need to re-extract or re-label anything.
Readability
Readability measures quantify how easy or difficult a text is to read and comprehend. They are used in learner corpus research as indirect measures of writing quality and complexity. A text written by a highly proficient learner should be neither excessively simple nor unnecessarily complex. Here we focus on the Flesch Reading Ease score(Flesch 1948):
where ASL is the average sentence length. Higher scores indicate easier text (shorter sentences, simpler words); scores below 30 indicate very difficult text, while scores above 70 indicate easy text. This creates a somewhat counter-intuitive situation for complexity research: higher proficiency does not necessarily mean higher Flesch scores — proficient writers often produce more complex (lower-scoring) text.
Check Your Understanding: Lexical Diversity and Readability
Q8. A researcher finds that Russian learners have the highest CTTR scores of all learner groups. Which interpretation is most warranted?
Spelling Errors
Section Overview
What you’ll learn: How to use the hunspell package to detect words not found in a standard English dictionary, how to compute a normalised spelling error rate, and how to compare it across L1 groups
Key package:hunspell
Spelling accuracy is one of the most directly observable dimensions of L2 written proficiency. Although advanced learners generally make fewer spelling errors than beginners, systematic differences in error rates across L1 groups can reveal specific orthographic challenges associated with transfer from particular L1 writing systems. For example, learners whose L1 uses a transparent orthography (consistent grapheme-phoneme mappings) may transfer phonological spelling strategies to English that produce systematic errors (e.g. spelling based on pronunciation rather than convention).
The hunspell package checks words against a dictionary and returns a list of words not found in it. We use the British English dictionary (en_GB) to match the LOCNESS corpus’s expected spelling norms.
Limitations of Dictionary-Based Spell Checking
Dictionary-based spell checking has several known limitations as a method for studying learner errors. It flags any word not in the dictionary — including proper nouns, technical vocabulary, abbreviations, and deliberate neologisms — as potential errors. It also misses real-word errors (e.g. their for there, form for from), which are spelled correctly but used in the wrong context. The counts here should therefore be interpreted as non-dictionary word rates rather than true spelling error rates.
Code
# Inspect non-dictionary words in the first L1 essayhunspell::hunspell(ns1_ess[1], dict = hunspell::dictionary("en_GB")) |>unlist() |>head(20)
# Function: count non-dictionary words and total words per text setspellStats <-function(texts, dict_code ="en_GB") { n_errors <- hunspell::hunspell(texts, dict = hunspell::dictionary(dict_code)) |>unlist() |>length() n_words <-sum(tokenizers::count_words(texts))list(errors = n_errors, words = n_words)}# Apply to all text setsspell_data <-list(en_ns1 =spellStats(ns1_ess), en_ns2 =spellStats(ns2_ess),de =spellStats(de_ess), es =spellStats(es_ess),fr =spellStats(fr_ess), it =spellStats(it_ess),pl =spellStats(pl_ess), ru =spellStats(ru_ess))
# A tibble: 7 × 2
l1 freq
<chr> <dbl>
1 de 23.4
2 en 17.0
3 es 27.6
4 fr 27.5
5 it 9
6 pl 7.9
7 ru 16.2
Code
err_tb |>ggplot(aes(x =reorder(l1, -freq), y = freq, label = freq, fill = l1)) +geom_col(width =0.7) +geom_text(vjust =1.5, colour ="white", size =4) +scale_x_discrete(labels = l1_labels) +scale_fill_brewer(palette ="Set2", guide ="none") +coord_cartesian(ylim =c(0, 45)) +theme_bw() +labs(title ="Non-Dictionary Word Rate by L1 Background",subtitle ="Per 1,000 words; British English dictionary (en_GB); includes proper nouns and technical vocabulary",x ="L1 background",y ="Non-dictionary words per 1,000 words" )
Check Your Understanding: Spelling Errors
Q9. A researcher uses hunspell to count spelling errors in learner essays and finds that L1 English speakers have a higher non-dictionary word rate than some learner groups. How should this be interpreted?
Summary and Best Practices
This tutorial has introduced seven corpus-based methods for analysing learner language in R. Working through these methods on the ICLE/LOCNESS data has illustrated both their practical application and their interpretive limits. A few overarching principles are worth emphasising:
Normalise before comparing. Raw frequency counts are almost always misleading when comparing across texts or corpora of different sizes. Always divide by total word count (or total token count) and express frequencies per 1,000 or per million words before drawing comparisons.
Use statistics to confirm visual impressions. Bar charts and box plots reveal patterns, but they cannot distinguish systematic differences from sampling variation. Fisher’s exact test (with Bonferroni correction for multiple comparisons) provides the inferential layer that transforms a visual observation into a defensible claim.
Triangulate across methods. No single measure fully captures the complexity of learner language. Sentence length, lexical diversity, POS-sequence frequencies, collocational patterns, readability, and spelling accuracy each illuminate a different dimension. Convergent evidence across multiple measures is more compelling than a single significant result.
Interpret measures in their context. Every measure discussed here has known limitations — TTR’s sensitivity to text length, Flesch’s insensitivity to syntactic complexity, hunspell’s conflation of errors with uncommon vocabulary. Always report these caveats alongside your findings.
Document your pre-processing. The results of learner corpus analyses are sensitive to cleaning decisions: which tags are stripped, how sentence boundaries are detected, which stopword list is used. Document every step in your code so that others (and your future self) can reproduce and evaluate your choices.
Citation and Session Info
Schweinberger, Martin. 2026. Analysing Learner Language with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/llr/llr.html (Version 2026.02.24).
@manual{schweinberger2026llr,
author = {Schweinberger, Martin},
title = {Analysing Learner Language with R},
note = {https://ladal.edu.au/tutorials/llr/llr.html},
year = {2026},
organization = {The Language Technology and Data Analysis Laboratory (LADAL)},
address = {Brisbane},
edition = {2026.02.24}
}
This tutorial was revised and substantially expanded with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to restructure the document into Quarto format, expand the analytical prose and interpretive sections, write the checkdown quiz questions and feedback strings, improve code consistency and style, add section overview callouts and learning objectives, and revise the background and summary sections. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material.
Flesch, Rudolph. 1948. “A New Readability Yardstick.”Journal of Applied Psychology 32 (3): 221–33. https://doi.org/https://doi.org/10.1037/h0057532.
Gilquin, Gaëtanelle, and S Granger. 2015. “From Design to Collection of Learner Corpora.”The Cambridge Handbook of Learner Corpus Research 3 (1): 9–34.
Granger, Sylviane. 2009. “The Contribution of Learner Corpora to Second Language Acquisition and Foreign Language Teaching.”Corpora and Language Teaching 33: 13–32.
Granger, Sylviane, Estelle Dagneaux, Fanny Meunier, Magali Paquot, et al. 1993. “The International Corpus of Learner English.”English Language Corpora: Design, Analysis and Exploitation, 57–71.
Granger, Sylviane, Carol Sanders, and Ulla Connor. 2005. “LOCNESS: Louvain Corpus of Native English Essays.” Louvain-la-Neuve, Belgium: Centre for English Corpus Linguistics (CECL), Université catholique de Louvain.
Lindquist, Hans. 2009. Corpus Linguistics and the Description of English. Vol. 104. Edinburgh: Edinburgh University Press.
Selinker, Larry, Merrill Swain, and Guy Dumas. 1975. “The Interlanguage Hypothesis Extended to Children 1.”Language Learning 25 (1): 139–52.
Source Code
---title: "Analysing Learner Language with R"author: "Martin Schweinberger"format: html: toc: true toc-depth: 4 code-fold: show code-tools: true theme: cosmo---```{r setup, echo=FALSE, message=FALSE, warning=FALSE}library(checkdown)options(stringsAsFactors = FALSE)options(scipen = 999)options(max.print = 1000)```{ width=100% }# Introduction {#intro}{ width=20% style="float:right; padding:10px" }This tutorial introduces methods for analysing **learner language** — the written and spoken production of second language (L2) learners — using R. Learner language, also called *interlanguage* [@selinker1975interlanguage], is the systematic, rule-governed variety of language produced by a learner at a given point in their development. It differs from the target language in predictable ways that reflect the learner's evolving grammatical and lexical knowledge, transfer from their first language (L1), and the specific instructional and communicative contexts they have encountered.Corpus-based approaches to learner language — using collections of authentic learner texts known as *learner corpora* — have grown substantially since the 1990s [@granger2009contribution]. They allow researchers to move beyond anecdotal observation and examine patterns of learner language systematically, at scale, and in comparison with the production of native or proficient speakers. The availability of well-annotated learner corpora and powerful R packages makes it possible to carry out many standard learner corpus analyses reproducibly, with relatively compact code.This tutorial covers seven core analysis types, progressing from basic frequency-based methods to more linguistically sophisticated measures:1. **Concordancing** — extracting and inspecting keyword-in-context (KWIC) lines2. **Frequency lists** — ranking words by frequency, with and without stopwords3. **Sentence length** — computing and comparing average sentence length across L1 groups4. **N-gram analysis** — extracting bigrams and comparing their use between learners and L1 speakers5. **Collocations and collocation networks** — identifying strongly co-occurring word pairs and visualising their relationships6. **Part-of-speech tagging and POS-sequence analysis** — automatically tagging word classes and comparing grammatical patterns7. **Lexical diversity and readability** — quantifying vocabulary richness and text complexity::: {.callout-note}## Prerequisite TutorialsBefore working through this tutorial, please ensure you are familiar with:- [Getting Started with R and RStudio](/tutorials/intror/intror.html)- [Loading, Saving, and Generating Data in R](/tutorials/load/load.html)- [String Processing in R](/tutorials/string/string.html)- [Regular Expressions in R](/tutorials/regex/regex.html)- [Introduction to Text Analysis — Part 1](/tutorials/introta/introta.html):::::: {.callout-tip}## Learning ObjectivesBy the end of this tutorial you will be able to:1. Load and inspect learner corpus data in R2. Extract and sort KWIC concordances for words and phrases3. Build and visualise frequency lists with and without stopwords4. Compute normalised sentence length and compare it across L1 groups5. Extract bigrams, normalise their frequencies, and test for significant learner–L1 differences using Fisher's exact test with Bonferroni correction6. Identify collocations using log-likelihood and visualise them as network graphs7. POS-tag texts with `udpipe` and compare POS-sequence bigrams across groups8. Calculate and compare multiple lexical diversity measures and Flesch readability scores9. Detect and quantify spelling errors using `hunspell`10. Interpret the results of these analyses in the context of second language acquisition research:::::: {.callout-note}## CitationSchweinberger, Martin. 2026. *Analysing Learner Language with R*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/llr/llr.html (Version 2026.02.24).:::---# Background: Learner Corpora {#background}::: {.callout-note}## Section Overview**What you'll learn:** What learner corpora are, how they differ from native-speaker corpora, and an introduction to the two corpora used in this tutorial:::A **learner corpus** is a principled, machine-readable collection of texts produced by L2 learners of a target language [@granger2009contribution]. Learner corpora can contain written essays, spoken transcripts, or both. They are typically annotated with metadata about the learner — their L1 background, proficiency level, age, educational context, and the task type that elicited the text. This metadata makes it possible to compare production across L1 groups, proficiency levels, or task types, and to contrast learner production with that of native or expert speakers.The analysis of learner corpora is central to the field of **Learner Corpus Research** (LCR), which applies corpus linguistic methods to questions in second language acquisition (SLA), language pedagogy, and language testing [@gilquin2015design]. Common research questions include: Do learners over- or under-use certain words, constructions, or discourse markers relative to native speakers? Does lexical diversity increase with proficiency? Are spelling or grammatical error rates associated with L1 background? Do learners from different L1 backgrounds show distinct error profiles?## The ICLE and LOCNESS Corpora {-}This tutorial uses data from two well-known learner corpus resources:**ICLE — International Corpus of Learner English** [@granger1993ICLE]. ICLE contains argumentative essays written by advanced university-level EFL learners from 16 different L1 backgrounds. The essays in this tutorial are drawn from the German, Spanish, French, Italian, Polish, and Russian sub-corpora of ICLE.**LOCNESS — Louvain Corpus of Native English Essays** [@granger2005LOCNESS]. LOCNESS contains essays written by native speakers of English, including British A-level students (the sub-corpus used here) and American university students. It provides the native-speaker baseline against which the ICLE learner data is compared.Both corpora were compiled by the *Centre for English Corpus Linguistics* (CECL) at the Université catholique de Louvain, Belgium. The essays in this tutorial deal with the topic of *transport*, which is one of the prompt topics common to both ICLE and LOCNESS, enabling direct comparison between learner and native-speaker production on the same task.::: {.callout-warning}## Data AccessThe ICLE and LOCNESS corpora are commercially licensed and cannot be distributed freely. In this tutorial we load pre-processed sub-samples hosted on the LADAL server, which is sufficient to follow all examples. If you wish to work with the full corpora, contact the CECL at UCLouvain.:::---# Setup {#setup}## Installing Packages {-}```{r prep1, echo=TRUE, eval=FALSE, message=FALSE, warning=FALSE}# Run once to install — comment out after installationinstall.packages("quanteda")install.packages("quanteda.textstats")install.packages("quanteda.textplots")install.packages("tidyverse")install.packages("flextable")install.packages("tidytext")install.packages("udpipe")install.packages("koRpus")install.packages("stringi")install.packages("hunspell")install.packages("wordcloud2")install.packages("tokenizers")install.packages("checkdown")```## Loading Packages {-}```{r prep2, message=FALSE, warning=FALSE}# Load at the start of every sessionlibrary(tidyverse) # dplyr, ggplot2, stringr, tidyrlibrary(flextable) # formatted display tableslibrary(tidytext) # stop_words and tidy text utilitieslibrary(udpipe) # POS tagginglibrary(quanteda) # corpus and KWIC infrastructurelibrary(quanteda.textstats) # textstat_collocations, textstat_readability, textstat_lexdivlibrary(quanteda.textplots) # textplot_xray, textplot_networklibrary(koRpus) # lexical diversity measureslibrary(stringi) # string reversal for sorted concordanceslibrary(hunspell) # spell checkinglibrary(wordcloud2) # word cloud visualisationlibrary(tokenizers) # sentence splitting and word countinglibrary(checkdown) # interactive quiz questions```## Loading the Data {-}We load seven essay files: two from LOCNESS (L1 British English, split across two files) and one each from six ICLE sub-corpora representing learners whose L1 is German, Spanish, French, Italian, Polish, or Russian.```{r load, echo=TRUE, message=FALSE, warning=FALSE}# L1 English (LOCNESS — British A-level essays)ns1 <- base::readRDS("tutorials/llr/data/LCorpus/ns1.rda", "rb")ns2 <- base::readRDS("tutorials/llr/data/LCorpus/ns2.rda", "rb")# L2 learners (ICLE sub-corpora)es <- base::readRDS("tutorials/llr/data/LCorpus/es.rda", "rb")de <- base::readRDS("tutorials/llr/data/LCorpus/de.rda", "rb")fr <- base::readRDS("tutorials/llr/data/LCorpus/fr.rda", "rb")it <- base::readRDS("tutorials/llr/data/LCorpus/it.rda", "rb")pl <- base::readRDS("tutorials/llr/data/LCorpus/pl.rda", "rb")ru <- base::readRDS("tutorials/llr/data/LCorpus/ru.rda", "rb")```Let us inspect the first few lines of the Russian learner data to see what the raw ICLE files look like:```{r load_inspect, message=FALSE, warning=FALSE}ru %>% stringr::str_remove("<[A-Z]{4,4}.*") %>% # remove ICLE file header tags na_if("") %>% na.omit() %>% head(5)```The text is stored as a character vector where each element is typically a paragraph or a short passage. The `<ICLE-...>` header tag that appears at the top of each file encodes metadata (the learner's L1, proficiency level, and essay topic); we strip this in subsequent processing steps.We also create two combined objects — one pooling all L1 texts and one pooling all learner texts — which we will use throughout:```{r load_combine, message=FALSE, warning=FALSE}l1 <- c(ns1, ns2) # all native-speaker textlearner <- c(de, es, fr, it, pl, ru) # all learner text```---# Concordancing {#concordancing}::: {.callout-note}## Section Overview**What you'll learn:** How to extract keyword-in-context (KWIC) concordances for individual words and phrases, how to sort them by preceding or following context, and how to visualise dispersion patterns across texts**Key function:** `quanteda::kwic()`:::**Concordancing** — the extraction of words or phrases from a corpus together with their surrounding context — is one of the most fundamental operations in corpus linguistics [@lindquist2009corpus]. A **keyword-in-context (KWIC)** display places the search term in the centre of a fixed-width window of preceding and following tokens, making it easy to inspect how a word is actually used across many instances. KWIC concordances are useful for: verifying that a search pattern is returning the intended tokens; examining how learners use a specific word or construction compared to L1 speakers; extracting authentic examples for pedagogical or analytical purposes; and as a preliminary step before more quantitative analyses.## Extracting a KWIC for a Single Word {-}We use `quanteda::kwic()` to extract concordance lines for the word *problem* and its morphological variants (e.g. *problems*) in the learner corpus. The `pattern` argument accepts a regular expression when `valuetype = "regex"` is set. A `window` of 10 tokens gives 10 words of left and right context.```{r conc1, message=FALSE, warning=FALSE}kwic_prob <- quanteda::kwic( quanteda::tokens(learner), pattern = "problem.*", # matches "problem", "problems", "problematic", etc. valuetype = "regex", window = 10) |> as.data.frame() |> dplyr::select(-from, -to, -pattern)head(kwic_prob)```The output table has one row per match. The `pre` column shows the left context, `keyword` the matched token, and `post` the right context. The `docname` column identifies which document in the corpus the match came from.## Sorting Concordances {-}One of the most useful things to do with a concordance is sort it. Sorting by right context (the words immediately following the keyword) reveals collocational patterns to the right; sorting by reversed left context reveals patterns to the left.```{r conc2, message=FALSE, warning=FALSE}# Sort by right context (alphabetically by first word after keyword)kwic_prob |> dplyr::arrange(post) |> head(8)``````{r conc3, message=FALSE, warning=FALSE}# Sort by reversed left context (reveals patterns immediately before keyword)kwic_prob |> dplyr::mutate(prerev = stringi::stri_reverse(pre)) |> dplyr::arrange(prerev) |> dplyr::select(-prerev) |> head(8)```Sorting by reversed left context is equivalent to sorting from the right edge of the left context inwards — it groups instances that share the same immediately preceding word (e.g. *the problem*, *a problem*, *this problem*), which is particularly useful for studying determiner or modifier patterns.## Visualising Dispersion {-}The `textplot_xray()` function from `quanteda.textplots` produces a **dispersion plot** (sometimes called a lexical dispersion plot) — a visual representation of where in each document a given term appears. Each vertical tick mark represents one occurrence; the horizontal axis represents the position within the document as a proportion of its total length.```{r conc4, message=FALSE, warning=FALSE, fig.width=9, fig.height=5}# Extract KWIC for two terms to compare their dispersion across learner textskwic_disp <- quanteda::kwic( quanteda::tokens(learner), pattern = c("people", "imagination"))quanteda.textplots::textplot_xray(kwic_disp) + theme_bw() + labs(title = "Dispersion of 'people' and 'imagination' in learner texts")```## Concordancing Phrases {-}To search for a multi-word sequence, wrap the pattern in `quanteda::phrase()`. The example below retrieves all instances of *very* followed by any single word:```{r conc5, message=FALSE, warning=FALSE}kwic_very <- quanteda::kwic( quanteda::tokens(learner), pattern = phrase("^very [a-z]{1,}"), # "very" + any lowercase word valuetype = "regex") |> as.data.frame() |> dplyr::select(-from, -to, -pattern)``````{r conc6, echo=FALSE, message=FALSE, warning=FALSE}kwic_very |> head(6) |> flextable::flextable() |> flextable::set_table_properties(width = .99, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption( caption = "First 6 concordance lines for the phrase 'very + word' in learner data." ) |> flextable::border_outer()```::: {.callout-tip}## When to Use Regex vs. Glob Patterns`quanteda::kwic()` supports three `valuetype` options: `"fixed"` (exact string match), `"glob"` (wildcard with `*` and `?`), and `"regex"` (full regular expression). For most single-word searches, `"glob"` with a trailing `*` (e.g. `"problem*"`) is the simplest option. Use `"regex"` when you need character classes, alternation (`|`), or more complex patterns. Use `phrase()` together with `"regex"` for multi-word patterns.:::---::: {.callout-tip}## Check Your Understanding: Concordancing:::**Q1. You want to find all instances of the phrase *in my opinion* in a learner corpus. Which `quanteda::kwic()` call is correct?**```{r}#| echo: false#| label: "LLR_Q1"check_question("quanteda::kwic(tokens(corpus), pattern = phrase('in my opinion'), valuetype = 'fixed')",options =c("quanteda::kwic(tokens(corpus), pattern = phrase('in my opinion'), valuetype = 'fixed')","quanteda::kwic(tokens(corpus), pattern = 'in my opinion', valuetype = 'fixed')","quanteda::kwic(tokens(corpus), pattern = 'in_my_opinion', valuetype = 'glob')","quanteda::kwic(tokens(corpus), pattern = c('in', 'my', 'opinion'), valuetype = 'fixed')" ),type ="radio",q_id ="LLR_Q1",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! For multi-word phrase searches in quanteda, you must wrap the phrase in phrase() — this tells kwic() to treat the string as a sequence of tokens rather than a single token. Without phrase(), quanteda would look for a single token that literally equals 'in my opinion' (which never exists, because whitespace is stripped during tokenisation). The valuetype = 'fixed' argument means no wildcards or regex are applied — the phrase must match exactly as written.",wrong ="Multi-word searches require the phrase() wrapper. Without it, quanteda treats the entire string as a single token pattern, which won't match anything after tokenisation has split the text into individual words.")```---**Q2. What is the purpose of reversing the left context (`stringi::stri_reverse(pre)`) before sorting a concordance?**```{r}#| echo: false#| label: "LLR_Q2"check_question("It allows sorting by the word immediately to the left of the keyword, grouping instances that share the same preceding collocate",options =c("It allows sorting by the word immediately to the left of the keyword, grouping instances that share the same preceding collocate","It reverses the order of concordance lines so the rarest instances appear first","It converts the left context to uppercase for case-insensitive sorting","It removes stopwords from the left context before sorting" ),type ="radio",q_id ="LLR_Q2",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! When you reverse each left-context string before sorting, the alphabetical sort operates from the right edge of the context inwards — meaning the word immediately preceding the keyword determines the sort order first, then the word before that, and so on. This groups concordance lines by their nearest left collocate (e.g. all instances of 'the problem' together, then 'a problem', etc.), which is very useful for studying what words immediately precede a target term.",wrong ="Think about what reversing a string does: 'the big problem' becomes 'melborp gib eht'. When you sort reversed strings alphabetically, what does the sort key correspond to in the original string?")```---# Frequency Lists {#frequency}::: {.callout-note}## Section Overview**What you'll learn:** How to build a word frequency list from corpus text, how to remove stopwords to reveal content-word distributions, and how to visualise frequency rankings as bar charts and word clouds:::**Frequency lists** — ranked lists of all word types and their occurrence counts — are among the most basic but informative tools in corpus linguistics. They give an immediate picture of the vocabulary profile of a text or corpus: which words are most common, how quickly frequency drops off across the vocabulary, and how a corpus's most frequent words compare to those of another corpus. Comparing the frequency profiles of learner and native-speaker texts can reveal systematic over- or under-use of specific words or word types.## Building a Frequency List {-}We build a frequency list for the pooled L1 (LOCNESS) data. The pipeline removes punctuation, normalises case, splits into word tokens, and counts.```{r fr1, message=FALSE, warning=FALSE}ftb <- c(ns1, ns2) |> stringr::str_replace_all("\\W", " ") |> # replace non-word characters with spaces stringr::str_squish() |> # collapse multiple spaces tolower() |> # normalise to lowercase stringr::str_split(" ") |> # split into word tokens unlist() |> as.data.frame() |> dplyr::rename(word = 1) |> dplyr::filter(word != "") |> dplyr::count(word, name = "freq") |> dplyr::arrange(desc(freq))head(ftb, 10)```The most frequent words are, unsurprisingly, grammatical function words: *the*, *a*, *of*, *and*, *to*. These are important for syntactic processing but carry little lexical content and are rarely of interest when studying vocabulary use.## Removing Stopwords {-}We use `dplyr::anti_join()` with the `stop_words` lexicon from the `tidytext` package to remove high-frequency function words, leaving only content words.```{r fr4, message=FALSE, warning=FALSE}ftb_content <- ftb |> dplyr::anti_join(stop_words, by = "word")head(ftb_content, 10)```The content-word list now surfaces topically informative vocabulary. Because the essays are about transport, we expect words like *transport*, *road*, *car*, *public*, and *people* to feature prominently — and indeed they do. This topical coherence also serves as a useful sanity check that the data have been loaded and processed correctly.## Visualising Frequency as a Bar Chart {-}```{r fr5, message=FALSE, warning=FALSE, fig.width=9, fig.height=5}ftb_content |> head(20) |> ggplot(aes(x = reorder(word, -freq), y = freq, label = freq)) + geom_col(fill = "#2166AC") + geom_text(vjust = 1.5, colour = "white", size = 3) + theme_bw() + theme(axis.text.x = element_text(size = 9, angle = 45, hjust = 1)) + labs( title = "Top 20 Content Words in L1 English (LOCNESS) Essays", subtitle = "Stopwords removed; essays on the topic of transport", x = "Word", y = "Frequency" )```## Visualising Frequency as a Word Cloud {-}Word clouds provide a visually engaging alternative for communicating frequency information. Word size is proportional to frequency. They are best suited for communication and initial exploration rather than precise quantitative comparison.```{r fr6, message=FALSE, warning=FALSE}wordcloud2::wordcloud2( ftb_content[1:100, ], shape = "diamond", color = scales::viridis_pal()(8))```::: {.callout-warning}## Word Clouds Are ApproximateWord clouds encode frequency information through font size, which is difficult to compare precisely across words. They are useful for getting a quick visual impression of vocabulary but should not be used for quantitative claims. For rigorous frequency comparisons, use bar charts or tables with exact counts.:::---::: {.callout-tip}## Check Your Understanding: Frequency Lists:::**Q3. You compare the raw frequency list and the content-word list for an English corpus and notice that *the* appears 4,200 times in the raw list but is absent from the content-word list. What caused the removal?**```{r}#| echo: false#| label: "LLR_Q3"check_question("anti_join(stop_words) removed it because 'the' appears in the tidytext stop_words lexicon",options =c("anti_join(stop_words) removed it because 'the' appears in the tidytext stop_words lexicon","str_replace_all('\\\\W', ' ') replaced 'the' with a space because it contains no vowels","tolower() converted 'the' to an empty string","filter(word != '') removed it because single-character words are excluded" ),type ="radio",q_id ="LLR_Q3",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! anti_join(stop_words) performs an anti-join between the frequency table and the stop_words data frame from tidytext: it keeps only rows in the frequency table whose 'word' value does NOT appear in stop_words. The word 'the' is one of the most frequent entries in the stop_words lexicon (it appears under the 'snowball', 'smart', and 'onix' lexicons), so it is removed. The other operations — str_replace_all, tolower, and filter — do not affect 'the'.",wrong ="Review the pipeline: which specific step uses the stop_words lexicon, and what does an anti_join do with it?")```---# Sentence Length {#senlen}::: {.callout-note}## Section Overview**What you'll learn:** How to split texts into individual sentences, how to count words per sentence, and how to compare sentence length distributions across L1 groups using box plots**Key functions:** `tokenizers::tokenize_sentences()`, `tokenizers::count_words()`:::**Average sentence length** (ASL) is one of the most widely used surface measures of syntactic complexity in learner language research. Longer sentences generally involve more syntactic subordination and coordination, which are markers of greater grammatical proficiency [@wolfe1998linguistic]. Comparing ASL across L1 groups and between learners and native speakers can reveal whether certain L1 backgrounds are associated with shorter, simpler sentences or longer, more complex ones.## Splitting Texts into Sentences {-}We write a reusable cleaning function that removes ICLE/LOCNESS file headers, strips internal quotation marks that can confuse sentence boundary detection, and then splits the text into individual sentences using `tokenizers::tokenize_sentences()`.```{r split1, message=FALSE, warning=FALSE}cleanText <- function(x) { x <- paste0(x) x <- stringr::str_remove_all(x, "<.*?>") # remove XML/HTML-style tags x <- stringr::str_remove_all(x, fixed("\"")) # remove double quotation marks x <- x[x != ""] # drop empty strings x <- tokenizers::tokenize_sentences(x) x <- unlist(x) return(x)}# Apply to all textsns1_sen <- cleanText(ns1); ns2_sen <- cleanText(ns2)de_sen <- cleanText(de); es_sen <- cleanText(es)fr_sen <- cleanText(fr); it_sen <- cleanText(it)pl_sen <- cleanText(pl); ru_sen <- cleanText(ru)``````{r split2, echo=FALSE, message=FALSE, warning=FALSE}ru_sen |> as.data.frame() |> head(5) |> flextable::flextable() |> flextable::set_table_properties(width = .99, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "First 5 sentences from the Russian learner data after splitting.") |> flextable::border_outer()```## Computing Sentence Lengths {-}`tokenizers::count_words()` returns the number of whitespace-delimited tokens in each sentence. We collect these counts into a single data frame with an `l1` column identifying the speaker group.```{r senl5, message=FALSE, warning=FALSE}sl_df <- data.frame( sentenceLength = c( tokenizers::count_words(ns1_sen), tokenizers::count_words(ns2_sen), tokenizers::count_words(de_sen), tokenizers::count_words(es_sen), tokenizers::count_words(fr_sen), tokenizers::count_words(it_sen), tokenizers::count_words(pl_sen), tokenizers::count_words(ru_sen) ), l1 = c( rep("en", tokenizers::count_words(ns1_sen) |> length()), rep("en", tokenizers::count_words(ns2_sen) |> length()), rep("de", tokenizers::count_words(de_sen) |> length()), rep("es", tokenizers::count_words(es_sen) |> length()), rep("fr", tokenizers::count_words(fr_sen) |> length()), rep("it", tokenizers::count_words(it_sen) |> length()), rep("pl", tokenizers::count_words(pl_sen) |> length()), rep("ru", tokenizers::count_words(ru_sen) |> length()) ))head(sl_df)```## Visualising Sentence Length Distributions {-}Box plots are ideal for comparing distributions across groups: they display the median, interquartile range, and outliers simultaneously, making differences in central tendency and spread immediately visible.```{r senl8, message=FALSE, warning=FALSE, fig.width=9, fig.height=5}l1_labels <- c("en" = "English", "de" = "German", "es" = "Spanish", "fr" = "French", "it" = "Italian", "pl" = "Polish", "ru" = "Russian")sl_df |> ggplot(aes(x = reorder(l1, -sentenceLength, mean), y = sentenceLength, fill = l1)) + geom_boxplot(outlier.alpha = 0.3, outlier.size = 1) + scale_x_discrete("L1 background", labels = l1_labels) + scale_fill_brewer(palette = "Set2", guide = "none") + theme_bw() + labs( title = "Sentence Length Distributions by L1 Background", subtitle = "L1 groups ordered by mean sentence length (descending)", y = "Sentence length (words)" )```The plot reveals considerable variation both within and across groups. L1 English speakers tend to produce longer sentences on average than most learner groups, which may reflect greater syntactic command of subordination and embedding. However, the wide interquartile ranges and overlapping distributions remind us that sentence length alone is a noisy and indirect indicator of proficiency — a very short sentence may be stylistically deliberate, and a very long sentence may be run-on or poorly constructed.---::: {.callout-tip}## Check Your Understanding: Sentence Length:::**Q4. A researcher finds that Polish learners produce significantly shorter sentences than L1 English speakers on average. Which of the following would be the most appropriate immediate follow-up analysis?**```{r}#| echo: false#| label: "LLR_Q4"check_question("Examine the syntactic structure of Polish learner sentences (e.g., rates of subordinate clauses) to determine whether shorter sentences reflect less syntactic complexity or a different discourse style",options =c("Examine the syntactic structure of Polish learner sentences (e.g., rates of subordinate clauses) to determine whether shorter sentences reflect less syntactic complexity or a different discourse style","Conclude that Polish learners have lower English proficiency than other groups, since sentence length is a direct measure of proficiency","Remove all sentences shorter than 5 words from the Polish data to eliminate noise before re-running the analysis","Replace sentence length with word length as the complexity measure, since word length is more reliable" ),type ="radio",q_id ="LLR_Q4",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! Sentence length is a useful but indirect and ambiguous measure. Shorter sentences could reflect less syntactic complexity (fewer subordinate clauses, less coordination), but they could also reflect a deliberate preference for short, punchy sentences, more frequent use of rhetorical questions, or a writing style common in Polish academic prose that transfers to L2 English. The appropriate next step is to examine the internal structure of sentences — using POS tags or dependency parses — to determine what is actually shorter or simpler about the Polish learners' sentences.",wrong ="Be careful about treating surface-level measures as direct proxies for proficiency. What else could cause shorter sentences, other than lower proficiency?")```---# N-gram Analysis {#ngrams}::: {.callout-note}## Section Overview**What you'll learn:** How to extract word bigrams from corpus texts, how to normalise their frequencies for cross-corpus comparison, and how to identify which bigrams are used significantly more or less often by learners compared to L1 speakers using Fisher's exact test with Bonferroni correction**Key functions:** `quanteda::tokens_ngrams()`, `fisher.test()`:::**N-grams** are contiguous sequences of *n* words. Bigrams (2-grams) capture word pairs; trigrams (3-grams) capture three-word sequences; and so on. N-gram analysis is used in learner corpus research to identify *formulaic sequences* — multi-word units that learners may use differently from native speakers. Over-use or under-use of certain bigrams can indicate L1 transfer effects, limited access to target-language formulaic sequences, or avoidance of specific collocational patterns.## Extracting Bigrams {-}We tokenise each sentence set, apply `tokens_ngrams(n = 2)`, and then build a unified data frame with an `l1` column and a binary `learner` column.```{r ng1, message=FALSE, warning=FALSE}# Tokenise (lowercase, punctuation removed)tok_list <- list( ns1 = quanteda::tokens(tolower(ns1_sen), remove_punct = TRUE), ns2 = quanteda::tokens(tolower(ns2_sen), remove_punct = TRUE), de = quanteda::tokens(tolower(de_sen), remove_punct = TRUE), es = quanteda::tokens(tolower(es_sen), remove_punct = TRUE), fr = quanteda::tokens(tolower(fr_sen), remove_punct = TRUE), it = quanteda::tokens(tolower(it_sen), remove_punct = TRUE), pl = quanteda::tokens(tolower(pl_sen), remove_punct = TRUE), ru = quanteda::tokens(tolower(ru_sen), remove_punct = TRUE))# Extract bigrams for each groupbigram_list <- lapply(tok_list, function(x) as.vector(unlist(quanteda::tokens_ngrams(x, n = 2))))# Inspect a samplehead(bigram_list$ns1, 10)``````{r ng3, message=FALSE, warning=FALSE}# Build unified data framengram_df <- data.frame( ngram = unlist(bigram_list), l1 = rep(names(bigram_list), sapply(bigram_list, length)), learner = ifelse(rep(names(bigram_list), sapply(bigram_list, length)) == "ns1" | rep(names(bigram_list), sapply(bigram_list, length)) == "ns2", "no", "yes"))head(ngram_df)```## Normalising Frequencies {-}Because the L1 and learner sub-corpora differ in total word count, we cannot compare raw bigram frequencies directly. We normalise to **per-1,000-word** rates by dividing each group's bigram count by the total number of bigrams produced by that group, then multiplying by 1,000.```{r ng5, message=FALSE, warning=FALSE}# Count bigrams per learner/L1 groupngram_freq <- ngram_df |> dplyr::group_by(ngram, learner) |> dplyr::summarise(freq = dplyr::n(), .groups = "drop") |> dplyr::arrange(desc(freq))# Add normalised frequenciesngram_norm <- ngram_freq |> dplyr::group_by(ngram) |> dplyr::mutate(total_ngram = sum(freq)) |> dplyr::group_by(learner) |> dplyr::mutate( total_group = sum(freq), rfreq = freq / total_group * 1000 ) |> dplyr::ungroup()# Spread to wide format (one column per group), fill missing with 0ngram_wide <- ngram_norm |> dplyr::select(ngram, learner, rfreq, total_ngram) |> tidyr::pivot_wider(names_from = learner, values_from = rfreq, values_fill = 0) |> dplyr::arrange(desc(total_ngram))head(ngram_wide, 10)```## Visualising the Most Frequent Bigrams {-}```{r ng10, message=FALSE, warning=FALSE, fig.width=10, fig.height=5}ngram_norm |> dplyr::select(ngram, learner, rfreq, total_ngram) |> tidyr::pivot_wider(names_from = learner, values_from = rfreq, values_fill = 0) |> dplyr::arrange(desc(total_ngram)) |> head(12) |> tidyr::pivot_longer(cols = c(no, yes), names_to = "learner", values_to = "rfreq") |> dplyr::mutate(learner = dplyr::recode(learner, "no" = "L1 speakers", "yes" = "Learners")) |> ggplot(aes(x = reorder(ngram, -total_ngram), y = rfreq, fill = learner)) + geom_col(position = position_dodge(width = 0.8), width = 0.7) + scale_fill_manual(values = c("L1 speakers" = "#2166AC", "Learners" = "#D6604D")) + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9), legend.position = "top") + labs( title = "Top 12 Bigrams: Normalised Frequency per 1,000 Words", subtitle = "Comparing L1 English speakers (LOCNESS) and EFL learners (ICLE)", x = "Bigram", y = "Frequency per 1,000 words", fill = NULL )```## Testing for Significant Differences {-}Visual comparison is informative but we need a statistical test to determine which differences are unlikely to be due to chance. We use **Fisher's exact test**, which is appropriate for count data in 2×2 contingency tables. For each bigram, the table compares: (a) its count in L1 data vs. (b) all other bigrams in L1 data, against (c) its count in learner data vs. (d) all other bigrams in learner data.Because we run one test per bigram (potentially thousands of tests), we apply a **Bonferroni correction**: the critical p-value is divided by the number of tests, making the threshold much more stringent and controlling the family-wise error rate.```{r ang1, message=FALSE, warning=FALSE}# Reshape for Fisher's test: one row per bigram, counts for each groupfisher_df <- ngram_freq |> tidyr::pivot_wider(names_from = learner, values_from = freq, values_fill = 0) |> dplyr::rename(l1speaker = no, learner = yes) |> dplyr::ungroup() |> dplyr::mutate( total_l1 = sum(l1speaker), total_learner = sum(learner), a = l1speaker, b = learner, c = total_l1 - l1speaker, d = total_learner - learner )``````{r ang3, message=FALSE, warning=FALSE}# Apply Fisher's exact test row-wise; add Bonferroni correctionfisher_results <- fisher_df |> dplyr::rowwise() |> dplyr::mutate( fisher_p = fisher.test(matrix(c(a, c, b, d), nrow = 2))$p.value, odds_ratio = fisher.test(matrix(c(a, c, b, d), nrow = 2))$estimate, crit = 0.05 / nrow(fisher_df), sig_corr = ifelse(fisher_p < crit, "p < .05*", "n.s.") ) |> dplyr::ungroup() |> dplyr::arrange(fisher_p) |> dplyr::select(ngram, l1speaker, learner, fisher_p, odds_ratio, sig_corr)head(fisher_results, 10)``````{r ang4, message=FALSE, warning=FALSE}# How many bigrams reach significance after Bonferroni correction?table(fisher_results$sig_corr)```::: {.callout-note}## Interpreting Fisher's Exact Test ResultsA significant result (p < Bonferroni-corrected threshold) means that the observed difference in bigram frequency between learners and L1 speakers is unlikely to have arisen by chance, given the corpus sizes. The **odds ratio** indicates the direction: an odds ratio > 1 means the bigram is proportionally more common in L1 speech; an odds ratio < 1 means it is proportionally more common in learner speech.In small sub-corpora like the present data, it is common to find few or no significant results after Bonferroni correction, because the test is very conservative when the number of comparisons is large. Larger corpora or a less conservative correction (e.g. the Benjamini-Hochberg false discovery rate procedure) would typically reveal more significant differences.:::---::: {.callout-tip}## Check Your Understanding: N-gram Analysis:::**Q5. Why is Bonferroni correction necessary when testing bigram frequency differences across an entire frequency list?**```{r}#| echo: false#| label: "LLR_Q5"check_question("Running many tests simultaneously inflates the probability of finding at least one false positive by chance; Bonferroni correction adjusts the critical p-value downward to control the family-wise error rate",options =c("Running many tests simultaneously inflates the probability of finding at least one false positive by chance; Bonferroni correction adjusts the critical p-value downward to control the family-wise error rate","Bonferroni correction increases statistical power by pooling information across all tests","It is required to convert Fisher's exact test p-values into odds ratios","Without correction, the odds ratios would be biased toward values greater than 1" ),type ="radio",q_id ="LLR_Q5",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! If you run 1,000 independent tests at α = .05, you expect approximately 50 false positives (significant results that are actually due to chance) even when there are no real differences. Bonferroni correction divides the critical threshold by the number of tests (α / n_tests), making each individual test much more stringent. The downside is reduced power — the corrected threshold may be so strict that genuine differences fail to reach significance in small corpora. This is the classic precision-recall trade-off in multiple testing.",wrong ="Think about what happens to the probability of at least one false positive when you run many tests. If each test has a 5% chance of a false positive, and you run 100 tests, what is the probability of getting at least one?")```---# Collocations and Collocation Networks {#collocations}::: {.callout-note}## Section Overview**What you'll learn:** How to identify statistically significant word collocations using log-likelihood, how to build a feature co-occurrence matrix, and how to visualise collocational relationships as a network graph**Key functions:** `quanteda.textstats::textstat_collocations()`, `quanteda::fcm()`, `quanteda.textplots::textplot_network()`:::A **collocation** is a pair (or larger group) of words that co-occur more frequently than chance would predict. Identifying collocations is important in learner language research because learners often struggle with the conventional collocational patterns of the target language — they may produce grammatically correct but collocationally unusual combinations (e.g. *make homework* instead of *do homework*) that mark their output as non-native [@nesselhauf2005collocations].## Identifying Collocations {-}We use `quanteda.textstats::textstat_collocations()`, which computes the **lambda** statistic (a log-likelihood-based association measure) for all word pairs appearing at least `min_count` times. Higher lambda values indicate stronger association — the pair co-occurs much more frequently than expected by chance.```{r coll5, message=FALSE, warning=FALSE}# Combine L1 sentences and tokenisens_sen <- c(ns1_sen, ns2_sen)ns_tokens <- quanteda::tokens(tolower(ns_sen), remove_punct = TRUE)# Identify collocations occurring at least 20 timesns_coll <- quanteda.textstats::textstat_collocations(ns_tokens, size = 2, min_count = 20)``````{r coll6, echo=FALSE, message=FALSE, warning=FALSE}ns_coll |> head(10) |> as.data.frame() |> flextable::flextable() |> flextable::set_table_properties(width = .75, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption( caption = "Top 10 collocations in the L1 English (LOCNESS) data, ranked by lambda (association strength)." ) |> flextable::border_outer()```The strongest collocations in this transport-themed corpus include named entities and set phrases specific to the topic. This is typical for domain-specific corpora: the most frequent collocations tend to be topically coherent multi-word units.## Building a Collocation Network {-}A **collocation network** visualises the collocational relationships around a target term as a graph, where nodes are words and edges represent co-occurrence strength. This makes it easy to see at a glance which words cluster together around a focal term.::: {.callout-warning}## External Script DependencyThe code below uses the `calculateCoocStatistics()` function, which is not part of any CRAN package. It is available as a standalone R script (`rscripts/calculateCoocStatistics.R`) in the LADAL repository. Download it from the [LADAL GitHub repository](https://github.com/SLCLADAL) and place it in a sub-folder called `rscripts/` within your R project before running this section. The function calculates log-likelihood-based co-occurrence statistics between a target term and all other terms in a Document-Feature Matrix.:::```{r dfm1, message=FALSE, warning=FALSE}# Build a DFM from the L1 sentences (stopwords removed)ns_dfm <- quanteda::dfm( quanteda::tokens(ns_sen, remove_punct = TRUE)) |> quanteda::dfm_remove(pattern = stopwords("english"))``````{r dfm3, message=FALSE, warning=FALSE}# Load the co-occurrence statistics functionsource("rscripts/calculateCoocStatistics.R")# Compute log-likelihood co-occurrence statistics for the target term "transport"coocTerm <- "transport"coocs <- calculateCoocStatistics(coocTerm, ns_dfm, measure = "LOGLIK")# Inspect the top 10 collocatescoocs[1:10]``````{r dfm4, message=FALSE, warning=FALSE}# Reduce DFM to the top 10 collocates plus the target wordredux_dfm <- quanteda::dfm_select( ns_dfm, pattern = c(names(coocs)[1:10], coocTerm))# Convert to Feature Co-occurrence Matrix (FCM)# FCM[i,j] = number of documents in which word i and word j both appeartag_fcm <- quanteda::fcm(redux_dfm)``````{r dfm8, message=FALSE, warning=FALSE, fig.width=7, fig.height=7}# Visualise the collocation networkquanteda.textplots::textplot_network( tag_fcm, min_freq = 1, edge_alpha = 0.4, edge_size = 5, edge_color = "gray75", vertex_labelsize = log(rowSums(tag_fcm) * 12))```In the network, the size of each node's label is proportional to its overall frequency in the DFM (a log-scaled version of the row sums). Edges between nodes represent co-occurrence in the same document. The target term *transport* should appear as the most central node, with the strongest edges connecting to its most frequent and most strongly associated collocates.---::: {.callout-tip}## Check Your Understanding: Collocations:::**Q6. A learner writes "she did a big mistake" instead of the native-speaker form "she made a big mistake". What type of error does this represent, and which corpus analysis method would be most appropriate to study it systematically?**```{r}#| echo: false#| label: "LLR_Q6"check_question("A collocational error (wrong verb choice in a verb-noun collocation); collocation analysis comparing learner and L1 frequencies of verb-noun pairs would identify systematic differences in verb selection",options =c("A collocational error (wrong verb choice in a verb-noun collocation); collocation analysis comparing learner and L1 frequencies of verb-noun pairs would identify systematic differences in verb selection","A spelling error; hunspell spell-checking would detect and count these systematically","A lexical diversity error; TTR analysis would reveal that the learner is using a less varied vocabulary","A sentence length error; the sentence is too short relative to the L1 norm" ),type ="radio",q_id ="LLR_Q6",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! 'Did a mistake' is a classic collocational error: the learner has selected a semantically plausible but collocationally incorrect verb. In English, 'mistake' collocates conventionally with 'make' (make a mistake), not 'do' (do a mistake). This kind of error is very common in learner corpora and is often attributed to L1 transfer (e.g. in many Romance and Slavic languages, the equivalent expression uses a verb meaning 'do/commit' rather than 'make'). Collocation analysis — specifically comparing the frequencies of [VERB + mistake] bigrams in learner and L1 data — would allow this pattern to be studied systematically across hundreds or thousands of instances.",wrong ="Focus on what is wrong with the sentence: the grammar is correct (subject + verb + article + adjective + noun), the spelling is correct, and the sentence length is normal. What specifically is unconventional about the word choice?")```---# Part-of-Speech Tagging and POS-Sequence Analysis {#pos}::: {.callout-note}## Section Overview**What you'll learn:** How to automatically assign part-of-speech tags to corpus texts using `udpipe`, how to extract POS-tag bigrams, how to compare their frequencies between learners and L1 speakers, and how to use KWIC to inspect the actual words behind significant differences**Key package:** `udpipe`:::**Part-of-speech (POS) tagging** is the automatic assignment of grammatical category labels (noun, verb, adjective, etc.) to each token in a text. Comparing POS-tag sequences between learner and native-speaker texts reveals grammatical differences that are invisible at the word level — for example, differences in the rate of adjective use, the frequency of passive constructions, or the distribution of subordinating conjunctions. POS-based analysis is particularly powerful for studying grammatical complexity and for identifying constructions that transfer from the learner's L1.::: {.callout-warning}## Required: udpipe Language ModelThis section requires a pre-trained `udpipe` language model for English. If you have not already downloaded it, run the following code once:```{r udpipe_download, eval=FALSE, echo=TRUE}m_eng <- udpipe::udpipe_download_model(language = "english-ewt")```This downloads the English EWT (English Web Treebank) model to your working directory. Subsequent sessions should load it directly using `udpipe_load_model()` with the path to the downloaded `.udpipe` file, as shown below. The file is approximately 16 MB. Do **not** re-download it in every session — the download creates a local copy that persists between sessions.:::## Testing POS Tagging on a Sample Sentence {-}Before tagging the full corpus, we test the tagger on a single sentence to inspect the output format and verify that tagging is working as expected.```{r pos2, message=FALSE, warning=FALSE}# Load the pre-downloaded English EWT modelm_eng <- udpipe::udpipe_load_model( file = here::here("udpipemodels", "english-ewt-ud-2.5-191206.udpipe"))# Tag a test sentencetest_sentence <- "It is now a very wide-spread opinion that in the modern world there is no place for dreaming and imagination."tagged_test <- udpipe::udpipe_annotate(m_eng, x = test_sentence) |> as.data.frame() |> dplyr::select(token, upos, xpos, dep_rel)head(tagged_test, 12)```The output contains several annotation columns. `upos` is the Universal POS tag (a coarse, cross-linguistically consistent tag set: NOUN, VERB, ADJ, etc.). `xpos` is the Penn Treebank tag (a finer-grained English-specific tag set: NN, VBZ, JJ, etc.). `dep_rel` is the dependency relation label. For our bigram analysis we use `xpos` tags, in which adjectives are tagged `JJ`, present-tense verbs `VBZ`, personal pronouns `PRP`, and so on.We create a tagged version of the text by concatenating each token with its `xpos` tag, separated by `/`:```{r pos3, message=FALSE, warning=FALSE}tagged_str <- paste0(tagged_test$token, "/", tagged_test$xpos, collapse = " ")tagged_str```## Tagging All Texts {-}We write a function `comText()` that cleans a text, runs `udpipe` annotation, and returns the token/tag string. We then apply it to all eight text sets.```{r pos4, message=FALSE, warning=FALSE}comText <- function(x) { x <- paste0(x, collapse = " ") x <- stringr::str_remove_all(x, "<.*?>") # remove markup tags x <- stringr::str_remove_all(x, fixed("\"")) # remove quotation marks x <- stringr::str_squish(x) x <- x[x != ""] annotated <- udpipe::udpipe_annotate(m_eng, x = x) |> as.data.frame() paste0(annotated$token, "/", annotated$xpos, collapse = " ")}# Apply to all texts (this step takes a few minutes per text)ns1_pos <- comText(ns1_sen); ns2_pos <- comText(ns2_sen)de_pos <- comText(de_sen); es_pos <- comText(es_sen)fr_pos <- comText(fr_sen); it_pos <- comText(it_sen)pl_pos <- comText(pl_sen); ru_pos <- comText(ru_sen)# Preview the first 300 characters of the L1 tagged textsubstr(ns1_pos, 1, 300)```## Extracting POS-Tag Bigrams {-}We extract bigrams of POS *tags only* (stripping the word tokens) and tabulate their frequencies for learners and L1 speakers.```{r pa1, message=FALSE, warning=FALSE}# Function: strip word tokens, keep only POS tags, extract tag bigramsposngram <- function(x) { x |> stringr::str_remove_all("\\w*/") |> # remove "word/" prefix from each token quanteda::tokens(remove_punct = TRUE) |> quanteda::tokens_ngrams(n = 2)}# Apply and unlistns1_posng <- as.vector(unlist(posngram(ns1_pos)))ns2_posng <- as.vector(unlist(posngram(ns2_pos)))de_posng <- as.vector(unlist(posngram(de_pos)))es_posng <- as.vector(unlist(posngram(es_pos)))fr_posng <- as.vector(unlist(posngram(fr_pos)))it_posng <- as.vector(unlist(posngram(it_pos)))pl_posng <- as.vector(unlist(posngram(pl_pos)))ru_posng <- as.vector(unlist(posngram(ru_pos)))head(ns1_posng, 8)``````{r pa5, message=FALSE, warning=FALSE}# Build unified table with learner/L1 labelsposngram_df <- data.frame( ngram = c(ns1_posng, ns2_posng, de_posng, es_posng, fr_posng, it_posng, pl_posng, ru_posng), l1 = c(rep("en", length(ns1_posng)), rep("en", length(ns2_posng)), rep("de", length(de_posng)), rep("es", length(es_posng)), rep("fr", length(fr_posng)), rep("it", length(it_posng)), rep("pl", length(pl_posng)), rep("ru", length(ru_posng))), learner = ifelse( c(rep("en", length(ns1_posng) + length(ns2_posng)), rep("l2", length(de_posng) + length(es_posng) + length(fr_posng) + length(it_posng) + length(pl_posng) + length(ru_posng))) == "en", "no", "yes")) |> dplyr::group_by(ngram, learner) |> dplyr::summarise(freq = dplyr::n(), .groups = "drop") |> dplyr::arrange(desc(freq))head(posngram_df, 8)```## Testing for Significant POS-Sequence Differences {-}We apply the same Fisher's exact test + Bonferroni correction approach as in the N-gram section, but now operating on POS-tag bigrams rather than word bigrams.```{r pa6, message=FALSE, warning=FALSE}posng_fisher <- posngram_df |> tidyr::pivot_wider(names_from = learner, values_from = freq, values_fill = 0) |> dplyr::rename(l1speaker = no, learner = yes) |> dplyr::ungroup() |> dplyr::mutate( total_l1 = sum(l1speaker), total_learner = sum(learner), a = l1speaker, b = learner, c = total_l1 - l1speaker, d = total_learner - learner )bonferroni_crit <- 0.05 / nrow(posng_fisher)posng_fisher <- posng_fisher |> dplyr::rowwise() |> dplyr::mutate( fisher_p = fisher.test(matrix(c(a, c, b, d), nrow = 2))$p.value, odds_ratio = fisher.test(matrix(c(a, c, b, d), nrow = 2))$estimate, sig_corr = ifelse(fisher_p < bonferroni_crit, "p < .05*", "n.s.") ) |> dplyr::ungroup() |> dplyr::arrange(fisher_p) |> dplyr::select(ngram, l1speaker, learner, fisher_p, odds_ratio, sig_corr)head(posng_fisher, 10)```## Inspecting Significant POS Sequences with KWIC {-}When a POS-tag bigram differs significantly between groups, the next step is to inspect the actual word tokens that realise that sequence using KWIC. For example, the sequence `PRP_VBZ` (personal pronoun + third-person singular present verb, as in *she runs*, *it matters*) can be searched directly in the tagged text.```{r pa9, message=FALSE, warning=FALSE}# Combine L1 and L2 tagged textsl1_pos <- c(ns1_pos, ns2_pos)l2_pos <- c(de_pos, es_pos, fr_pos, it_pos, pl_pos, ru_pos)# KWIC for PRP_VBZ in L1 dataPRP_VBZ_l1 <- quanteda::kwic( quanteda::tokens(l1_pos), pattern = phrase("\\w*/PRP \\w*/VBZ"), valuetype = "regex", window = 8) |> as.data.frame() |> dplyr::select(-from, -to, -docname, -pattern)head(PRP_VBZ_l1, 6)``````{r pa10, message=FALSE, warning=FALSE}# KWIC for PRP_VBZ in learner dataPRP_VBZ_l2 <- quanteda::kwic( quanteda::tokens(l2_pos), pattern = phrase("\\w*/PRP \\w*/VBZ"), valuetype = "regex", window = 8) |> as.data.frame() |> dplyr::select(-from, -to, -docname, -pattern)head(PRP_VBZ_l2, 6)```Comparing the two concordances reveals whether the same syntactic pattern is associated with different lexical choices in learner and L1 production — for instance, whether learners favour different pronoun–verb combinations or use the pattern in different pragmatic contexts.---::: {.callout-tip}## Check Your Understanding: POS Tagging:::**Q7. Why do we strip the word tokens (e.g. convert `it/PRP` to just `PRP`) before extracting POS-tag bigrams in the sequence analysis?**```{r}#| echo: false#| label: "LLR_Q7"check_question("Because we want to compare abstract grammatical patterns (sequences of word classes) rather than specific word combinations — removing the tokens makes every instance of the same POS sequence comparable regardless of which words realise it",options =c("Because we want to compare abstract grammatical patterns (sequences of word classes) rather than specific word combinations — removing the tokens makes every instance of the same POS sequence comparable regardless of which words realise it","Because udpipe tags contain encoding errors that cause problems if the word form is retained","Because including word tokens would make the bigrams too long to store efficiently in memory","Because the POS tags are on the left side of the / separator and the function only reads the left side" ),type ="radio",q_id ="LLR_Q7",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! The goal of POS-sequence analysis is to identify differences in grammatical structure — how often learners use patterns like [determiner + noun], [pronoun + copula], or [auxiliary + participle] compared to L1 speakers. If we kept the word tokens, every bigram would be a unique word-pair (e.g. 'it/PRP is/VBZ', 'she/PRP runs/VBZ', 'he/PRP thinks/VBZ' would all be different bigrams), and we would need a much larger corpus to get reliable frequency estimates for each specific combination. By stripping to POS tags only, all these instances collapse into a single PRP_VBZ bigram, which is much more frequent and therefore more amenable to statistical comparison.",wrong ="Think about what the analysis is trying to measure: is it about which specific words learners use, or about the grammatical patterns (word class sequences) they prefer?")```---# Lexical Diversity and Readability {#lexdiv}::: {.callout-note}## Section Overview**What you'll learn:** How to compute multiple lexical diversity measures (TTR, CTTR, Herdan's C, Guiraud's R, Maas) and Flesch readability scores for each essay, and how to compare these measures across L1 groups**Key packages:** `koRpus`, `quanteda.textstats`:::## Lexical Diversity {-}**Lexical diversity** refers to the range and variety of vocabulary used in a text. It is considered an important indicator of L2 proficiency: more proficient learners tend to deploy a wider range of word types and rely less heavily on a small set of high-frequency forms [@daller2003assessing]. Numerous measures of lexical diversity have been proposed, each capturing a slightly different aspect of vocabulary range:| Measure | Formula | Notes ||---------|---------|-------|| TTR | V / N | Types / Tokens; decreases with text length || CTTR | V / √(2N) | Carroll's Corrected TTR; partially controls for length || C (Herdan) | log(V) / log(N) | LogTTR; more robust to length than TTR || R (Guiraud) | V / √N | Root TTR || U (Dugast) | log²(N) / (log(N) − log(V)) | Uber index || Maas | (log(N) − log(V)) / log²(N) | Inverse measure; lower = more diverse |: Lexical diversity measures {tbl-colwidths="[15,35,50]"}### Splitting Texts into Individual Essays {-}Before computing lexical diversity, we split each text file into individual essays. In the ICLE/LOCNESS data, essays are separated by headers of the form *Transport 01*, *Transport 02*, etc.```{r ld1, message=FALSE, warning=FALSE}cleanEss <- function(x) { x |> paste0(collapse = " ") |> stringr::str_split("Transport [0-9]{1,2}") |> unlist() |> stringr::str_squish() |> (\(v) v[v != ""])()}ns1_ess <- cleanEss(ns1); ns2_ess <- cleanEss(ns2)de_ess <- cleanEss(de); es_ess <- cleanEss(es)fr_ess <- cleanEss(fr); it_ess <- cleanEss(it)pl_ess <- cleanEss(pl); ru_ess <- cleanEss(ru)# Preview the first essay from the L1 datasubstr(ns1_ess[1], 1, 400)```### Computing Lexical Diversity Measures {-}We use `koRpus::lex.div()`, which computes multiple lexical diversity measures simultaneously. The `segment` and `window` arguments control the size of the moving window used for measures that are computed incrementally.```{r ld2, message=FALSE, warning=FALSE}# Build a DFM from individual essays, one document per essayall_ess <- c(ns1_ess, ns2_ess, de_ess, es_ess, fr_ess, it_ess, pl_ess, ru_ess)ess_dfm <- quanteda::dfm( quanteda::tokens(all_ess, remove_punct = TRUE))# Compute lexical diversity measuresld_all <- quanteda.textstats::textstat_lexdiv( ess_dfm, measure = c("TTR", "C", "R", "CTTR", "U", "Maas"))# Add L1 labelsld_all$l1 <- c( rep("en", length(ns1_ess) + length(ns2_ess)), rep("de", length(de_ess)), rep("es", length(es_ess)), rep("fr", length(fr_ess)), rep("it", length(it_ess)), rep("pl", length(pl_ess)), rep("ru", length(ru_ess)))head(ld_all)```### Extracting and Visualising CTTR {-}We focus on **Carroll's Corrected TTR (CTTR)** as a representative lexical diversity measure. Higher CTTR values indicate greater lexical diversity.```{r ld5, message=FALSE, warning=FALSE}# Extract CTTR values from the textstat_lexdiv output; add L1 labelscttr <- ld_all |> dplyr::select(l1, CTTR)head(cttr)``````{r ld7, message=FALSE, warning=FALSE, fig.width=9, fig.height=5}cttr |> dplyr::group_by(l1) |> dplyr::summarise(CTTR = mean(CTTR, na.rm = TRUE), .groups = "drop") |> ggplot(aes(x = reorder(l1, CTTR), y = CTTR, colour = l1)) + geom_point(size = 5) + scale_x_discrete(labels = l1_labels) + scale_colour_brewer(palette = "Set2", guide = "none") + coord_cartesian(ylim = c(0, 15)) + theme_bw() + labs( title = "Mean Lexical Diversity (CTTR) by L1 Background", subtitle = "Carroll's Corrected Type-Token Ratio; higher = more diverse vocabulary", x = "L1 background", y = "Mean CTTR" )```Since `ld_all` was already constructed with L1 labels in the previous step, the `ld5` chunk simplifies to a straightforward `select()` — no need to re-extract or re-label anything.---## Readability {-}**Readability** measures quantify how easy or difficult a text is to read and comprehend. They are used in learner corpus research as indirect measures of writing quality and complexity. A text written by a highly proficient learner should be neither excessively simple nor unnecessarily complex. Here we focus on the **Flesch Reading Ease score** [@flesch1948new]:$$\text{Flesch} = 206.835 - (1.015 \times \text{ASL}) - \left(84.6 \times \frac{N_{\text{syllables}}}{N_{\text{words}}}\right)$$where ASL is the average sentence length. **Higher scores indicate easier text** (shorter sentences, simpler words); scores below 30 indicate very difficult text, while scores above 70 indicate easy text. This creates a somewhat counter-intuitive situation for complexity research: *higher proficiency does not necessarily mean higher Flesch scores* — proficient writers often produce more complex (lower-scoring) text.```{r rd1, message=FALSE, warning=FALSE}# Compute Flesch scores for each essayns1_read <- quanteda.textstats::textstat_readability(ns1_ess, measure = "Flesch")ns2_read <- quanteda.textstats::textstat_readability(ns2_ess, measure = "Flesch")de_read <- quanteda.textstats::textstat_readability(de_ess, measure = "Flesch")es_read <- quanteda.textstats::textstat_readability(es_ess, measure = "Flesch")fr_read <- quanteda.textstats::textstat_readability(fr_ess, measure = "Flesch")it_read <- quanteda.textstats::textstat_readability(it_ess, measure = "Flesch")pl_read <- quanteda.textstats::textstat_readability(pl_ess, measure = "Flesch")ru_read <- quanteda.textstats::textstat_readability(ru_ess, measure = "Flesch")``````{r rd3, message=FALSE, warning=FALSE}# Combine into one table with L1 labelsread_all <- base::rbind( ns1_read, ns2_read, de_read, es_read, fr_read, it_read, pl_read, ru_read) |> dplyr::mutate( l1 = c( rep("en", nrow(ns1_read) + nrow(ns2_read)), rep("de", nrow(de_read)), rep("es", nrow(es_read)), rep("fr", nrow(fr_read)), rep("it", nrow(it_read)), rep("pl", nrow(pl_read)), rep("ru", nrow(ru_read)) ), l1 = factor(l1, levels = c("en", "de", "es", "fr", "it", "pl", "ru")) ) |> dplyr::group_by(l1) |> dplyr::summarise(Flesch = mean(Flesch, na.rm = TRUE), .groups = "drop")read_all``````{r rd4, message=FALSE, warning=FALSE, fig.width=9, fig.height=5}read_all |> ggplot(aes(x = l1, y = Flesch, label = round(Flesch, 1), fill = l1)) + geom_col(width = 0.7) + geom_text(vjust = 1.5, colour = "white", size = 4) + scale_x_discrete(labels = l1_labels) + scale_fill_brewer(palette = "Set2", guide = "none") + coord_cartesian(ylim = c(0, 80)) + theme_bw() + labs( title = "Mean Flesch Reading Ease Score by L1 Background", subtitle = "Higher scores = easier/simpler text; lower scores = more complex text", x = "L1 background", y = "Mean Flesch Reading Ease" )```---::: {.callout-tip}## Check Your Understanding: Lexical Diversity and Readability:::**Q8. A researcher finds that Russian learners have the highest CTTR scores of all learner groups. Which interpretation is most warranted?**```{r}#| echo: false#| label: "LLR_Q8"check_question("Russian learners use a more varied vocabulary per essay than other learner groups in this sample, though this could reflect essay length, topic treatment, or L1 transfer rather than proficiency alone",options =c("Russian learners use a more varied vocabulary per essay than other learner groups in this sample, though this could reflect essay length, topic treatment, or L1 transfer rather than proficiency alone","Russian is the most similar language to English, so transfer effects boost vocabulary diversity","Russian learners are definitively more proficient than all other learner groups","The CTTR scores prove that the Russian ICLE sub-corpus contains more essays than the other sub-corpora" ),type ="radio",q_id ="LLR_Q8",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! A higher CTTR value does mean greater measured lexical diversity within this sample, but the cause and interpretation require caution. CTTR (like all TTR-based measures) is sensitive to text length — longer essays tend to have lower TTR because repetition increases with length, so if Russian essays happen to be shorter, CTTR would be artificially inflated. The result could also reflect differences in task interpretation, L1 literacy practices that transfer to L2 writing, or genuine vocabulary breadth. Proficiency is one possible explanation but should not be the only one considered.",wrong ="A higher CTTR does indicate greater measured lexical diversity, but is that the same as definitively greater proficiency? What else might affect CTTR scores?")```---# Spelling Errors {#spelling}::: {.callout-note}## Section Overview**What you'll learn:** How to use the `hunspell` package to detect words not found in a standard English dictionary, how to compute a normalised spelling error rate, and how to compare it across L1 groups**Key package:** `hunspell`:::Spelling accuracy is one of the most directly observable dimensions of L2 written proficiency. Although advanced learners generally make fewer spelling errors than beginners, systematic differences in error rates across L1 groups can reveal specific orthographic challenges associated with transfer from particular L1 writing systems. For example, learners whose L1 uses a transparent orthography (consistent grapheme-phoneme mappings) may transfer phonological spelling strategies to English that produce systematic errors (e.g. spelling based on pronunciation rather than convention).The `hunspell` package checks words against a dictionary and returns a list of words not found in it. We use the British English dictionary (`en_GB`) to match the LOCNESS corpus's expected spelling norms.::: {.callout-warning}## Limitations of Dictionary-Based Spell CheckingDictionary-based spell checking has several known limitations as a method for studying learner errors. It flags any word not in the dictionary — including proper nouns, technical vocabulary, abbreviations, and deliberate neologisms — as potential errors. It also misses *real-word errors* (e.g. *their* for *there*, *form* for *from*), which are spelled correctly but used in the wrong context. The counts here should therefore be interpreted as *non-dictionary word rates* rather than true spelling error rates.:::```{r sp1, message=FALSE, warning=FALSE}# Inspect non-dictionary words in the first L1 essayhunspell::hunspell(ns1_ess[1], dict = hunspell::dictionary("en_GB")) |> unlist() |> head(20)``````{r sp4, message=FALSE, warning=FALSE}# Function: count non-dictionary words and total words per text setspellStats <- function(texts, dict_code = "en_GB") { n_errors <- hunspell::hunspell(texts, dict = hunspell::dictionary(dict_code)) |> unlist() |> length() n_words <- sum(tokenizers::count_words(texts)) list(errors = n_errors, words = n_words)}# Apply to all text setsspell_data <- list( en_ns1 = spellStats(ns1_ess), en_ns2 = spellStats(ns2_ess), de = spellStats(de_ess), es = spellStats(es_ess), fr = spellStats(fr_ess), it = spellStats(it_ess), pl = spellStats(pl_ess), ru = spellStats(ru_ess))``````{r sp6, message=FALSE, warning=FALSE}# Build summary table: normalised error rate per 1,000 wordserr_tb <- data.frame( l1 = c("en", "en", "de", "es", "fr", "it", "pl", "ru"), errors = sapply(spell_data, `[[`, "errors"), words = sapply(spell_data, `[[`, "words")) |> dplyr::mutate(freq = round(errors / words * 1000, 1)) |> dplyr::group_by(l1) |> dplyr::summarise(freq = mean(freq), .groups = "drop")err_tb``````{r sp7, message=FALSE, warning=FALSE, fig.width=9, fig.height=5}err_tb |> ggplot(aes(x = reorder(l1, -freq), y = freq, label = freq, fill = l1)) + geom_col(width = 0.7) + geom_text(vjust = 1.5, colour = "white", size = 4) + scale_x_discrete(labels = l1_labels) + scale_fill_brewer(palette = "Set2", guide = "none") + coord_cartesian(ylim = c(0, 45)) + theme_bw() + labs( title = "Non-Dictionary Word Rate by L1 Background", subtitle = "Per 1,000 words; British English dictionary (en_GB); includes proper nouns and technical vocabulary", x = "L1 background", y = "Non-dictionary words per 1,000 words" )```---::: {.callout-tip}## Check Your Understanding: Spelling Errors:::**Q9. A researcher uses `hunspell` to count spelling errors in learner essays and finds that L1 English speakers have a higher non-dictionary word rate than some learner groups. How should this be interpreted?**```{r}#| echo: false#| label: "LLR_Q9"check_question("L1 speakers may use more proper nouns, colloquialisms, contractions, or domain-specific vocabulary not present in the standard dictionary — the hunspell count is not a pure spelling error count",options =c("L1 speakers may use more proper nouns, colloquialisms, contractions, or domain-specific vocabulary not present in the standard dictionary — the hunspell count is not a pure spelling error count","This proves that L1 speakers make more spelling mistakes than learners","The British English dictionary is biased against native speakers, so their errors are over-counted","L1 speakers intentionally write non-standard forms to demonstrate their proficiency" ),type ="radio",q_id ="LLR_Q9",random_answer_order =TRUE,button_label ="Check answer",right ="Correct! hunspell flags any word not in its dictionary as a potential error, but a non-dictionary word is not necessarily a misspelling. L1 speakers may freely use proper nouns (place names, people's names), informal contractions, slang, acronyms, and domain-specific vocabulary that are absent from a standard general-English dictionary. They may also be more willing to coin new words or use creative forms. Learners, by contrast, may stay closer to simple, common vocabulary that is reliably in the dictionary. This is one of the key limitations of dictionary-based spell checking for learner language research.",wrong ="Review the limitations callout in this section: what kinds of words does hunspell flag besides genuine spelling errors?")```---# Summary and Best Practices {#summary}This tutorial has introduced seven corpus-based methods for analysing learner language in R. Working through these methods on the ICLE/LOCNESS data has illustrated both their practical application and their interpretive limits. A few overarching principles are worth emphasising:**Normalise before comparing.** Raw frequency counts are almost always misleading when comparing across texts or corpora of different sizes. Always divide by total word count (or total token count) and express frequencies per 1,000 or per million words before drawing comparisons.**Use statistics to confirm visual impressions.** Bar charts and box plots reveal patterns, but they cannot distinguish systematic differences from sampling variation. Fisher's exact test (with Bonferroni correction for multiple comparisons) provides the inferential layer that transforms a visual observation into a defensible claim.**Triangulate across methods.** No single measure fully captures the complexity of learner language. Sentence length, lexical diversity, POS-sequence frequencies, collocational patterns, readability, and spelling accuracy each illuminate a different dimension. Convergent evidence across multiple measures is more compelling than a single significant result.**Interpret measures in their context.** Every measure discussed here has known limitations — TTR's sensitivity to text length, Flesch's insensitivity to syntactic complexity, hunspell's conflation of errors with uncommon vocabulary. Always report these caveats alongside your findings.**Document your pre-processing.** The results of learner corpus analyses are sensitive to cleaning decisions: which tags are stripped, how sentence boundaries are detected, which stopword list is used. Document every step in your code so that others (and your future self) can reproduce and evaluate your choices.---# Citation and Session Info {-}Schweinberger, Martin. 2026. *Analysing Learner Language with R*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/llr/llr.html (Version 2026.02.24).```@manual{schweinberger2026llr, author = {Schweinberger, Martin}, title = {Analysing Learner Language with R}, note = {https://ladal.edu.au/tutorials/llr/llr.html}, year = {2026}, organization = {The Language Technology and Data Analysis Laboratory (LADAL)}, address = {Brisbane}, edition = {2026.02.24}}``````{r session_info}sessionInfo()```::: {.callout-note}## AI Transparency StatementThis tutorial was revised and substantially expanded with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. Claude was used to restructure the document into Quarto format, expand the analytical prose and interpretive sections, write the `checkdown` quiz questions and feedback strings, improve code consistency and style, add section overview callouts and learning objectives, and revise the background and summary sections. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material.:::---[Back to top](#intro)[Back to LADAL home](/)---# References {-}