# Introduction

This tutorial introduces how to extract concordances and keyword-in-context (KWIC) displays with R.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to extract keywords and key phrases from textual data and how to process the resulting concordances using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with concordancing.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.

Here is a link to an interactive version of this tutorial on Google Colab. The interactive tutorial is based on a Jupyter notebook of this tutorial. This interactive Jupyter notebook allows you to execute code yourself and - if you copy the Jupyter notebook - you can also change and edit the notebook, e.g. you can change code and upload your own data.

In the language sciences, concordancing refers to the extraction of words from a given text or texts . Commonly, concordances are displayed in the form of keyword-in-context displays (KWICs) where the search term is shown in context, i.e. with preceding and following words. Concordancing are central to analyses of text and they often represents the first step in more sophisticated analyses of language data . The play such a key role in the language sciences because concordances are extremely valuable for understanding how a word or phrase is used, how often it is used, and in which contexts is used. As concordances allow us to analyze the context in which a word or phrase occurs and provide frequency information about word use, they also enable us to analyze collocations or the collocational profiles of words and phrases . Finally, concordances can also be used to extract examples and it is a very common procedure.

There are various very good software packages that can be used to create concordances - both for offline use (e.g. AntConc , SketchEngine, MONOCONC, and ParaConc) and online use (see e.g. here).

In addition, many corpora that are available such as the BYU corpora can be accessed via a web interface that have in-built concordancing functions.

While these packages are very user-friendly, offer various additional functionalities, and almost everyone who is engaged in analyzing language has used concordance software, they all suffer from shortcomings that render R a viable alternative. Such issues include that these applications

• are black boxes that researchers do not have full control over or do not know what is going on within the software

• they are not open source

• they hinder replication because the replications is more time consuming compared to analyses based on Notebooks.

• they are commonly not free-of charge or have other restrictions on use (a notable exception is AntConc)

• is extremely flexible and enables researchers to perform their entire analysis in a single environment

• allows full transparency and documentation as analyses can be based on Notebooks

• offer version control measures (this means that the specific versions of the involved software are traceable)

• makes research more replicable as entire analyses can be reproduced by simply running the Notebooks that the research is based on

Especially the aspect that R enables full transparency and replicability is relevant given the ongoing Replication Crisis . The Replication Crisis is a ongoing methodological crisis primarily affecting parts of the social and life sciences beginning in the early 2010s (see also Fanelli 2009). Replication is important so that other researchers, or the public for that matter, can see or, indeed, reproduce, exactly what you have done. Fortunately, R allows you to document your entire workflow as you can store everything you do in what is called a script or a notebook (in fact, this document was originally a R notebook). If someone is then interested in how you conducted your analysis, you can simply share this notebook or the script you have written with that person.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# install packages
install.packages("quanteda")
install.packages("tidyverse")
install.packages("gutenbergr")
install.packages("flextable")
install.packages("plyr")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we activate them as shown below.

# set options
options(stringsAsFactors = F)          # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# activate packages
library(quanteda)
library(gutenbergr)
library(tidyverse)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R and RStudio and also initiated the session by executing the code shown above, you are good to go.

For this tutorial, we will use Charles Darwin’s On the Origin of Species by means of Natural Selection which we download from the Project Gutenberg archive (see Stroube 2003). Thus, Darwin’s Origin of Species forms the basis of our analysis. You can use the code below to download this text into R (but you have to have access to the internet to do so).

origin <- gutenberg_works(
# define id of darwin's origin in project gutenberg
gutenberg_id == "1228") %>%
mirror = "http://mirrors.xmission.com/gutenberg/") %>%
# remove empty rows
dplyr::filter(text != "")
 gutenberg_id text 1,228 Click on any of the filenumbers below to quickly view each ebook. 1,228 1228 1859, First Edition 1,228 22764 1860, Second Edition 1,228 2009 1872, Sixth Edition, considered the definitive edition. 1,228 On 1,228 the Origin of Species 1,228 BY MEANS OF NATURAL SELECTION, 1,228 OR THE 1,228 PRESERVATION OF FAVOURED RACES IN THE STRUGGLE FOR LIFE. 1,228 By Charles Darwin, M.A.,

The table above shows that Darwin’s Origin of Species requires formatting so that we can use it. Therefore, we collapse it into a single object (or text) and remove superfluous white spaces.

origin <- origin$text %>% # collapse lines into a single text paste0(collapse = " ") %>% # remove superfluous white spaces str_squish()  . Click on any of the filenumbers below to quickly view each ebook. 1228 1859, First Edition 22764 1860, Second Edition 2009 1872, Sixth Edition, considered the definitive edition. On the Origin of Species BY MEANS OF NATURAL SELECTION, OR THE PRESERVATION OF FAVOURED RACES IN THE STRUGGLE FOR LIFE. By Charles Darwin, M.A., Fellow Of The Royal, Geological, Linnæan, Etc., Societies; Author Of ‘Journal Of Researches During H.M.S. Beagle’s Voyage Round The World.’ LONDON: JOHN MURRAY, ALBEMARLE STREET. 1859. “But with regard to the material world, we can at least go so far as this—we can perceive that events are brought about not by insulated interpositions of Divine power, exerted in each particular case, but by the establishment of general laws.” W. WHEWELL: _Bridgewater Treatise_. “To conclude, therefore, let no man out of a weak conceit of sobriety, or an ill-applied moderation, think or maintain, that a man can search too far or be too well studied in the book of God’s word, or in the The result confirms that the entire text is now combined into a single character object. ## Creating simple concordances Now that we have loaded the data, we can easily extract concordances using the kwic function from the quanteda package. The kwic function takes the text (x) and the search pattern (pattern) as it main arguments but it also allows the specification of the context window, i.e. how many words/elements are show to the left and right of the key word (we will go over this later on). kwic_natural <- kwic( # define text origin, # define search pattern pattern = "selection")  docname from to pre keyword post pattern text1 44 44 Species BY MEANS OF NATURAL SELECTION , OR THE PRESERVATION OF selection text1 275 275 EXISTENCE . 4 . NATURAL SELECTION . 5 . LAWS OF selection text1 411 411 and Origin . Principle of Selection anciently followed , its Effects selection text1 421 421 Effects . Methodical and Unconscious Selection . Unknown Origin of our selection text1 436 436 favourable to Man's power of Selection . CHAPTER 2 . VARIATION selection text1 522 522 EXISTENCE . Bears on natural selection . The term used in selection text1 616 616 . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its selection text1 619 619 . NATURAL SELECTION . Natural Selection : its power compared with selection text1 626 626 its power compared with man's selection , its power on characters selection text1 647 647 on both sexes . Sexual Selection . On the generality of selection You will see that you get a warning stating that you should use token f´before extracting concordances. This can be done as shown below. Also, we can specify the package from which we want to use a function by adding the package name plus :: before the function (see below) kwic_natural <- quanteda::kwic( # define and tokenize text quanteda::tokens(origin), # define search pattern pattern = "selection")  docname from to pre keyword post pattern text1 44 44 Species BY MEANS OF NATURAL SELECTION , OR THE PRESERVATION OF selection text1 275 275 EXISTENCE . 4 . NATURAL SELECTION . 5 . LAWS OF selection text1 411 411 and Origin . Principle of Selection anciently followed , its Effects selection text1 421 421 Effects . Methodical and Unconscious Selection . Unknown Origin of our selection text1 436 436 favourable to Man's power of Selection . CHAPTER 2 . VARIATION selection text1 522 522 EXISTENCE . Bears on natural selection . The term used in selection text1 616 616 . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its selection text1 619 619 . NATURAL SELECTION . Natural Selection : its power compared with selection text1 626 626 its power compared with man's selection , its power on characters selection text1 647 647 on both sexes . Sexual Selection . On the generality of selection We can easily extract the frequency of the search term (selection) using the nrow or the length functions which provide the number of rows of a tables (nrow) or the length of a vector (length). nrow(kwic_natural) ## [1] 412 length(kwic_natural$keyword)
## [1] 412

The results show that there are 414 instances of the search term (selection) but we can also find out how often different variants (lower case versus upper case) of the search term were found using the table function. This is especially useful when searches involve many different search terms (while it is, admittedly, less useful in the present example).

table(kwic_natural$keyword) ## ## selection Selection SELECTION ## 369 39 4 To get a better understanding of the use of a word, it is often useful to extract more context. This is easily done by increasing size of the context window. To do this, we specify the window argument of the kwic function. In the example below, we set the context window size to 10 words/elements rather than using the default (which is 5 word/elements). kwic_natural_longer <- kwic( # define text origin, # define search pattern pattern = "selection", # define context window size window = 10)  docname from to pre keyword post pattern text1 44 44 . On the Origin of Species BY MEANS OF NATURAL SELECTION , OR THE PRESERVATION OF FAVOURED RACES IN THE STRUGGLE selection text1 275 275 . 3 . STRUGGLE FOR EXISTENCE . 4 . NATURAL SELECTION . 5 . LAWS OF VARIATION . 6 . DIFFICULTIES selection text1 411 411 Domestic Pigeons , their Differences and Origin . Principle of Selection anciently followed , its Effects . Methodical and Unconscious Selection selection text1 421 421 Selection anciently followed , its Effects . Methodical and Unconscious Selection . Unknown Origin of our Domestic Productions . Circumstances favourable selection text1 436 436 our Domestic Productions . Circumstances favourable to Man's power of Selection . CHAPTER 2 . VARIATION UNDER NATURE . Variability . selection text1 522 522 CHAPTER 3 . STRUGGLE FOR EXISTENCE . Bears on natural selection . The term used in a wide sense . Geometrical selection text1 616 616 most important of all relations . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its power compared with man's selection selection text1 619 619 all relations . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its power compared with man's selection , its power selection text1 626 626 SELECTION . Natural Selection : its power compared with man's selection , its power on characters of trifling importance , its selection text1 647 647 power at all ages and on both sexes . Sexual Selection . On the generality of intercrosses between individuals of the selection EXERCISE TIME!  1. Extract the first 10 concordances for the word nature. Answer  kwic_nature <- kwic(x = origin, pattern = "nature")  ## Warning: 'kwic.character()' is deprecated. Use 'tokens()' first.  # inspect kwic_natural %>% as.data.frame() %>% head(10)  ## docname from to pre keyword ## 1 text1 44 44 Species BY MEANS OF NATURAL SELECTION ## 2 text1 275 275 EXISTENCE . 4 . NATURAL SELECTION ## 3 text1 411 411 and Origin . Principle of Selection ## 4 text1 421 421 Effects . Methodical and Unconscious Selection ## 5 text1 436 436 favourable to Man's power of Selection ## 6 text1 522 522 EXISTENCE . Bears on natural selection ## 7 text1 616 616 . CHAPTER 4 . NATURAL SELECTION ## 8 text1 619 619 . NATURAL SELECTION . Natural Selection ## 9 text1 626 626 its power compared with man's selection ## 10 text1 647 647 on both sexes . Sexual Selection ## post pattern ## 1 , OR THE PRESERVATION OF selection ## 2 . 5 . LAWS OF selection ## 3 anciently followed , its Effects selection ## 4 . Unknown Origin of our selection ## 5 . CHAPTER 2 . VARIATION selection ## 6 . The term used in selection ## 7 . Natural Selection : its selection ## 8 : its power compared with selection ## 9 , its power on characters selection ## 10 . On the generality of selection 1. How many instances are there of the word nature? Answer  kwic_nature %>% as.data.frame() %>% nrow()  ## [1] 261 1. Extract concordances for the word origin and show the first 5 concordance lines. Answer  kwic_origin <- kwic(x = origin, pattern = "origin")  ## Warning: 'kwic.character()' is deprecated. Use 'tokens()' first.  # inspect kwic_origin %>% as.data.frame() %>% head(5)  ## docname from to pre keyword ## 1 text1 37 37 definitive edition . On the Origin ## 2 text1 351 351 DETEAILED CONTENTS . ON THE ORIGIN ## 3 text1 391 391 between Varieties and Species . Origin ## 4 text1 407 407 Pigeons , their Differences and Origin ## 5 text1 424 424 and Unconscious Selection . Unknown Origin ## post pattern ## 1 of Species BY MEANS OF origin ## 2 OF SPECIES . INTRODUCTION . origin ## 3 of Domestic Varieties from one origin ## 4 . Principle of Selection anciently origin ## 5 of our Domestic Productions . origin  ## Extracting more than single words While extracting single words is very common, you may want to extract more than just one word. To extract phrases, all you need to so is to specify that the pattern you are looking for is a phrase, as shown below. kwic_naturalselection <- kwic(origin, pattern = phrase("natural selection"))  docname from to pre keyword post pattern text1 43 44 of Species BY MEANS OF NATURAL SELECTION , OR THE PRESERVATION OF natural selection text1 274 275 FOR EXISTENCE . 4 . NATURAL SELECTION . 5 . LAWS OF natural selection text1 521 522 FOR EXISTENCE . Bears on natural selection . The term used in natural selection text1 615 616 relations . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its natural selection text1 618 619 4 . NATURAL SELECTION . Natural Selection : its power compared with natural selection text1 666 667 Circumstances favourable and unfavourable to Natural Selection , namely , intercrossing , natural selection text1 685 686 action . Extinction caused by Natural Selection . Divergence of Character , natural selection text1 709 710 to naturalisation . Action of Natural Selection , through Divergence of Character natural selection text1 753 754 and disuse , combined with natural selection ; organs of flight and natural selection text1 925 926 embraced by the theory of Natural Selection . CHAPTER 7 . INSTINCT natural selection Of course you can extend this to longer sequences such as entire sentences. However, you may want to extract more or less concrete patterns rather than words or phrases. To search for patterns rather than words, you need to include regular expressions in your search pattern. EXERCISE TIME!  1. Extract the first 10 concordances for the phrase natural habitat. Answer  kwic_naturalhabitat <- kwic(x = origin, pattern = phrase("natural habitat"))  ## Warning: 'kwic.character()' is deprecated. Use 'tokens()' first.  # inspect kwic_naturalhabitat %>% as.data.frame() %>% head(10)  ## [1] docname from to pre keyword post pattern ## <0 Zeilen> (oder row.names mit Länge 0) 1. How many instances are there of the phrase natural habitat? Answer  kwic_naturalhabitat %>% as.data.frame() %>% nrow()  ## [1] 0 1. Extract concordances for the phrase the origin and show the first 5 concordance lines. Answer  kwic_theorigin <- kwic(x = origin, pattern = phrase("the origin"))  ## Warning: 'kwic.character()' is deprecated. Use 'tokens()' first.  # inspect kwic_theorigin %>% as.data.frame() %>% head(5)  ## docname from to pre keyword ## 1 text1 36 37 the definitive edition . On the Origin ## 2 text1 350 351 INDEX DETEAILED CONTENTS . ON THE ORIGIN ## 3 text1 1617 1618 . Concluding remarks . ON THE ORIGIN ## 4 text1 1679 1680 to throw some light on the origin ## 5 text1 1910 1911 conclusions that I have on the origin ## post pattern ## 1 of Species BY MEANS OF the origin ## 2 OF SPECIES . INTRODUCTION . the origin ## 3 OF SPECIES . INTRODUCTION . the origin ## 4 of species - that mystery the origin ## 5 of species . Last year the origin  ## Searches using regular expressions Regular expressions allow you to search for abstract patterns rather than concrete words or phrases which provides you with an extreme flexibility in what you can retrieve. A regular expression (in short also called regex or regexp) is a special sequence of characters that stand for are that describe a pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids. For example, the sequence [a-z]{1,3} is a regular expression that stands for one up to three lower case characters and if you searched for this regular expression, you would get, for instance, is, a, an, of, the, my, our, etc, and many other short words as results. There are three basic types of regular expressions: • regular expressions that stand for individual symbols and determine frequencies • regular expressions that stand for classes of symbols • regular expressions that stand for structural properties The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.  RegEx Symbol/Sequence Explanation Example ? The preceding item is optional and will be matched at most once walk[a-z]? = walk, walks * The preceding item will be matched zero or more times walk[a-z]* = walk, walks, walked, walking + The preceding item will be matched one or more times walk[a-z]+ = walks, walked, walking {n} The preceding item is matched exactly n times walk[a-z]{2} = walked {n,} The preceding item is matched n or more times walk[a-z]{2,} = walked, walking {n,m} The preceding item is matched at least n times, but not more than m times walk[a-z]{2,3} = walked, walking The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.  RegEx Symbol/Sequence Explanation [ab] lower case a and b [AB] upper case a and b [12] digits 1 and 2 [:digit:] digits: 0 1 2 3 4 5 6 7 8 9 [:lower:] lower case characters: a–z [:upper:] upper case characters: A–Z [:alpha:] alphabetic characters: a–z and A–Z [:alnum:] digits and alphabetic characters [:punct:] punctuation characters: . , ; etc. [:graph:] graphical characters: [:alnum:] and [:punct:] [:blank:] blank characters: Space and tab [:space:] space characters: Space, tab, newline, and other space characters [:print:] printable characters: [:alnum:], [:punct:] and [:space:] The regular expressions that denote classes of symbols are enclosed in [] and :. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.  RegEx Symbol/Sequence Explanation \\w Word characters: [[:alnum:]_] \\W No word characters: [^[:alnum:]_] \\s Space characters: [[:blank:]] \\S No space characters: [^[:blank:]] \\d Digits: [[:digit:]] \\D No digits: [^[:digit:]] \\b Word edge \\B No word edge < Word beginning > Word end ^ Beginning of a string$ End of a string

To include regular expressions in your KWIC searches, you include them in your search pattern and set the argument valuetype to "regex". The search pattern "\\bnatu.*|\\bselec.*" retrieves elements that contain natu and selec followed by any characters and where the n in natu and the s in selec are at a word boundary, i.e. where they are the first letters of a word. Hence, our search would not retrieve words like unnatural or deselect. The | is an operator (like +, -, or *) that stands for or.

# define search patterns
patterns <- c("\\bnatu.*|\\bselec.*")
kwic_regex <- kwic(
# define text
origin,
# define search pattern
patterns,
# define valuetype
valuetype = "regex")
 docname from to pre keyword post pattern text1 43 43 of Species BY MEANS OF NATURAL SELECTION , OR THE PRESERVATION \bnatu.*|\bselec.* text1 44 44 Species BY MEANS OF NATURAL SELECTION , OR THE PRESERVATION OF \bnatu.*|\bselec.* text1 264 264 . 2 . VARIATION UNDER NATURE . 3 . STRUGGLE FOR \bnatu.*|\bselec.* text1 274 274 FOR EXISTENCE . 4 . NATURAL SELECTION . 5 . LAWS \bnatu.*|\bselec.* text1 275 275 EXISTENCE . 4 . NATURAL SELECTION . 5 . LAWS OF \bnatu.*|\bselec.* text1 411 411 and Origin . Principle of Selection anciently followed , its Effects \bnatu.*|\bselec.* text1 421 421 Effects . Methodical and Unconscious Selection . Unknown Origin of our \bnatu.*|\bselec.* text1 436 436 favourable to Man's power of Selection . CHAPTER 2 . VARIATION \bnatu.*|\bselec.* text1 443 443 CHAPTER 2 . VARIATION UNDER NATURE . Variability . Individual Differences \bnatu.*|\bselec.* text1 521 521 FOR EXISTENCE . Bears on natural selection . The term used \bnatu.*|\bselec.*

EXERCISE TIME!



1. Extract the first 10 concordances for words containing exu.
  kwic_exu <- kwic(x = origin, pattern = ".*exu.*", valuetype = "regex")
  ## Warning: 'kwic.character()' is deprecated. Use 'tokens()' first.
  # inspect
kwic_exu %>%
as.data.frame() %>%
head(10)
  ##    docname  from    to                               pre keyword
## 1    text1   646   646               and on both sexes .  Sexual
## 2    text1   806   806 variable than generic : secondary  sexual
## 3    text1 29294 29294               and on both sexes .  Sexual
## 4    text1 31953 31953      like every other structure . _Sexual
## 5    text1 32040 32040              words on what I call  Sexual
## 6    text1 32082 32082             few or no offspring .  Sexual
## 7    text1 32157 32157     chance of leaving offspring .  Sexual
## 8    text1 32330 32330         be given through means of  sexual
## 9    text1 32628 32628   having been chiefly modified by  sexual
## 10   text1 32726 32726        have been mainly caused by  sexual
##                                post pattern
## 1     Selection . On the generality .*exu.*
## 2  characters variable . Species of .*exu.*
## 3     Selection . On the generality .*exu.*
## 4        Selection_ . - Inasmuch as .*exu.*
## 5        Selection . This depends , .*exu.*
## 6        selection is , therefore , .*exu.*
## 7  selection by always allowing the .*exu.*
## 8           selection , as the mane .*exu.*
## 9       selection , acting when the .*exu.*
## 10            selection ; that is , .*exu.*
1. How many instances are there of words beginning with nonet?
  kwic_nonet <- kwic(x = origin, pattern = "\\bnonet.*", valuetype = "regex") %>%
as.data.frame() %>%
nrow()
  ## Warning: 'kwic.character()' is deprecated. Use 'tokens()' first.
1. Extract concordances for words ending with ption and show the first 5 concordance lines.
  kwic_ption <- kwic(x = origin, pattern = "ption\\b", valuetype = "regex")
  ## Warning: 'kwic.character()' is deprecated. Use 'tokens()' first.
  # inspect
kwic_ption %>%
as.data.frame() %>%
head(5)
  ##   docname from   to                          pre    keyword
## 1   text1 1605 1605    extended . Effects of its   adoption
## 2   text1 2641 2641          see them ; but this assumption
## 3   text1 3926 3926         or at the instant of conception
## 4   text1 3990 3990          prior to the act of conception
## 5   text1 4233 4233 under confinement , with the  exception
##                          post  pattern
## 1     on the study of Natural ption\\b
## 2           seems to me to be ption\\b
## 3   . Geoffroy St . Hilaire's ption\\b
## 4   . Several reasons make me ption\\b
## 5 of the plantigrades or bear ption\\b

## Piping concordances

Quite often, we only want to retrieve patterns if they occur in a certain context. For instance, we might be interested in instances of selection but only if the preceding word is natural. Such conditional concordances could be extracted using regular expressions but they are easier to retrieve by piping. Piping is done using the %>% function from the dplyr package and the piping sequence can be translated as and then. We can then filter those concordances that contain natural using the filter function from the dplyr package. Note the the $ stands for the end of a string so that natural$ means that natural is the last element in the string that is preceding the keyword.

kwic_pipe <- kwic(x = origin, pattern = "selection") %>%
dplyr::filter(stringr::str_detect(pre, "natural$|NATURAL$"))
 docname from to pre keyword post pattern text1 44 44 Species BY MEANS OF NATURAL SELECTION , OR THE PRESERVATION OF selection text1 275 275 EXISTENCE . 4 . NATURAL SELECTION . 5 . LAWS OF selection text1 522 522 EXISTENCE . Bears on natural selection . The term used in selection text1 616 616 . CHAPTER 4 . NATURAL SELECTION . Natural Selection : its selection text1 754 754 disuse , combined with natural selection ; organs of flight and selection text1 1,597 1,597 far the theory of natural selection may be extended . Effects selection text1 6,617 6,617 do occur ; but natural selection , as will hereafter be selection text1 14,827 14,827 a process of " natural selection , " as will hereafter selection text1 17,269 17,269 they afford materials for natural selection to accumulate , in the selection text1 17,819 17,819 and rendered definite by natural selection , as hereafter will be selection

Piping is a very useful helper function and it is very frequently used in R - not only in the context of text processing but in all data science related domains.

## Arranging concordances and adding frequency information

When inspecting concordances, it is useful to re-order the concordances so that they do not appear in the order that they appeared in the text or texts but by the context. To reorder concordances, we can use the arrange function from the dplyr package which takes the column according to which we want to re-arrange the data as it main argument.

In the example below, we extract all instances of natural and then arrange the instances according to the content of the post column in alphabetical.

kwic_ordered <- kwic(x = origin, pattern = "natural") %>%
dplyr::arrange(post)
 docname from to pre keyword post pattern text1 176,207 176,207 , 190 . System , natural , 413 . Tail : natural text1 176,668 176,668 , 159 . Varieties : natural , 44 . struggle between natural text1 175,731 175,731 . unconscious , 34 . natural , 80 . sexual , natural text1 147,280 147,280 and this would be strictly natural , as it would connect natural text1 175,739 175,739 . sexual , 87 . natural , circumstances favourable to , natural text1 146,387 146,387 genealogical in order to be natural ; but that the _amount_ natural text1 111,947 111,947 of old forms , both natural and artificial , are bound natural text1 56,947 56,947 parts having been accumulated by natural and sexual selection , and natural text1 56,630 56,630 be taken advantage of by natural and sexual selection , in natural text1 150,464 150,464 , or at least a natural arrangement , would be possible natural

Arranging concordances according to alphabetical properties may, however, not be the most useful option. A more useful option may be to arrange concordances according to the frequency of co-occurring terms or collocates. In order to do this, we need to extract the co-occurring words and calculate their frequency. We can do this by combining the mutate, group_by, n() functions from the dplyr package with the str_remove_all function from the stringr package. Then, we arrange the concordances by the frequency of the collocates in descending order (that is why we put a - in the arrange function). In order to do this, we need to

1. create a new variable or column which represents the word that co-occurs with, or, as in the example below, immediately follows the search term. In the example below, we use the mutate function to create a new column called post_word. We then use the str_remove_all function to remove everything except for the word that immediately follows the search term (we simply remove everything and including a white space).

2. group the data by the word that immediately follows the search term.

3. create a new column called post_word_freq which represents the frequencies of all the words that immediately follow the search term.

4. arrange the concordances by the frequency of the collocates in descending order.

kwic_ordered_coll <- kwic(
# define text
x = origin,
# define search pattern
pattern = "natural") %>%
# extract word following the keyword
dplyr::mutate(post_word = str_remove_all(post, " .*")) %>%
# group following words
dplyr::group_by(post_word) %>%
# extract frequencies of the following words
dplyr::mutate(post_word_freq = n()) %>%
# arrange/order by the frequency of the following word
dplyr::arrange(-post_word_freq)
 docname from to pre keyword post pattern post_word post_word_freq text1 618 618 4 . NATURAL SELECTION . Natural Selection : its power compared natural Selection 4 text1 666 666 Circumstances favourable and unfavourable to Natural Selection , namely , intercrossing natural Selection 4 text1 685 685 action . Extinction caused by Natural Selection . Divergence of Character natural Selection 4 text1 709 709 to naturalisation . Action of Natural Selection , through Divergence of natural Selection 4 text1 43 43 of Species BY MEANS OF NATURAL SELECTION , OR THE PRESERVATION natural SELECTION 3 text1 274 274 FOR EXISTENCE . 4 . NATURAL SELECTION . 5 . LAWS natural SELECTION 3 text1 615 615 relations . CHAPTER 4 . NATURAL SELECTION . Natural Selection : natural SELECTION 3 text1 521 521 FOR EXISTENCE . Bears on natural selection . The term used natural selection 1

We add more columns according to which we could arrange the concordance following the same schema. For example, we could add another column that represented the frequency of words that immediately preceded the search term and then arrange according to this column.

## Ordering by subsequent elements

In this section, we will extract the three words following the keyword (selection) and organize the concordances by the frequencies of the following words. We begin by inspecting the first 6 lines of the concordance of selection.

head(kwic_natural)
## Keyword-in-context with 6 matches.
##   [text1, 44]         Species BY MEANS OF NATURAL | SELECTION |
##  [text1, 275]               EXISTENCE. 4. NATURAL | SELECTION |
##  [text1, 411]            and Origin. Principle of | Selection |
##  [text1, 421] Effects. Methodical and Unconscious | Selection |
##  [text1, 436]        favourable to Man's power of | Selection |
##  [text1, 522]         EXISTENCE. Bears on natural | selection |
##
##  , OR THE PRESERVATION OF
##  . 5. LAWS OF
##  anciently followed, its Effects
##  . Unknown Origin of our
##  . CHAPTER 2. VARIATION
##  . The term used in

Next, we take the concordances and create a clean post column that is all in lower case and that does not contain any punctuation.

kwic_natural %>%
# convert to data frame
as.data.frame() %>%
# create new CleanPost
dplyr::mutate(CleanPost = stringr::str_remove_all(post, "[:punct:]"),
CleanPost = stringr::str_squish(CleanPost),
CleanPost = tolower(CleanPost))-> kwic_natural_following
# inspect
head(kwic_natural_following)
##   docname from  to                                  pre   keyword
## 1   text1   44  44          Species BY MEANS OF NATURAL SELECTION
## 2   text1  275 275              EXISTENCE . 4 . NATURAL SELECTION
## 3   text1  411 411            and Origin . Principle of Selection
## 4   text1  421 421 Effects . Methodical and Unconscious Selection
## 5   text1  436 436         favourable to Man's power of Selection
## 6   text1  522 522         EXISTENCE . Bears on natural selection
##                               post   pattern                      CleanPost
## 1         , OR THE PRESERVATION OF selection         or the preservation of
## 2                    . 5 . LAWS OF selection                      5 laws of
## 3 anciently followed , its Effects selection anciently followed its effects
## 4          . Unknown Origin of our selection          unknown origin of our
## 5          . CHAPTER 2 . VARIATION selection            chapter 2 variation
## 6               . The term used in selection               the term used in

In a next step, we extract the 1st, 2nd, and 3rd words following the keyword.

kwic_natural_following %>%
# extract first element after keyword
dplyr::mutate(FirstWord = stringr::str_remove_all(CleanPost, " .*")) %>%
# extract second element after keyword
dplyr::mutate(SecWord = stringr::str_remove(CleanPost, ".*? "),
SecWord = stringr::str_remove_all(SecWord, " .*")) %>%
# extract third element after keyword
dplyr::mutate(ThirdWord = stringr::str_remove(CleanPost, ".*? "),
ThirdWord = stringr::str_remove(ThirdWord, ".*? "),
ThirdWord = stringr::str_remove_all(ThirdWord, " .*")) -> kwic_natural_following
# inspect
head(kwic_natural_following)
##   docname from  to                                  pre   keyword
## 1   text1   44  44          Species BY MEANS OF NATURAL SELECTION
## 2   text1  275 275              EXISTENCE . 4 . NATURAL SELECTION
## 3   text1  411 411            and Origin . Principle of Selection
## 4   text1  421 421 Effects . Methodical and Unconscious Selection
## 5   text1  436 436         favourable to Man's power of Selection
## 6   text1  522 522         EXISTENCE . Bears on natural selection
##                               post   pattern                      CleanPost
## 1         , OR THE PRESERVATION OF selection         or the preservation of
## 2                    . 5 . LAWS OF selection                      5 laws of
## 3 anciently followed , its Effects selection anciently followed its effects
## 4          . Unknown Origin of our selection          unknown origin of our
## 5          . CHAPTER 2 . VARIATION selection            chapter 2 variation
## 6               . The term used in selection               the term used in
##   FirstWord  SecWord    ThirdWord
## 1        or      the preservation
## 2         5     laws           of
## 3 anciently followed          its
## 4   unknown   origin           of
## 5   chapter        2    variation
## 6       the     term         used

Next, we calculate the frequencies of the subsequent words and order in descending order from the 1st to the 3rd word following the keyword.

kwic_natural_following %>%
# calculate frequency of following words
# 1st word
dplyr::group_by(FirstWord) %>%
dplyr::mutate(FreqW1 = n()) %>%
# 2nd word
dplyr::group_by(SecWord) %>%
dplyr::mutate(FreqW2 = n()) %>%
# 3rd word
dplyr::group_by(ThirdWord) %>%
dplyr::mutate(FreqW3 = n()) %>%
# ungroup
dplyr::ungroup() %>%
# arrange by following words
dplyr::arrange(-FreqW1, -FreqW2, -FreqW3) -> kwic_natural_following
# inspect results
head(kwic_natural_following, 10)
## # A tibble: 10 × 14
##    docname  from    to pre     keyword post  pattern CleanPost FirstWord SecWord
##    <chr>   <int> <int> <chr>   <chr>   <chr> <fct>   <chr>     <chr>     <chr>
##  1 text1    3064  3064 This f… Select… will… select… will be … will      be
##  2 text1   31421 31421 state … select… will… select… will be … will      be
##  3 text1   31988 31988 and if… select… will… select… will be … will      be
##  4 text1   60694 60694 slow p… select… will… select… will in … will      in
##  5 text1   15600 15600 called… select… will… select… will alw… will      always
##  6 text1   37304 37304 as mig… select… will… select… will alw… will      always
##  7 text1   72213 72213 becaus… select… will… select… will alw… will      always
##  8 text1   39275 39275 new sp… select… will… select… will alw… will      always
##  9 text1   39449 39449 I do b… select… will… select… will alw… will      always
## 10 text1   43007 43007 modifi… select… will… select… will alw… will      always
## # … with 4 more variables: ThirdWord <chr>, FreqW1 <int>, FreqW2 <int>,
## #   FreqW3 <int>

The results now show the concordance arranged by the frequency of the words following the keyword.

## Concordances from transcriptions

As many analyses use transcripts as their primary data and because transcripts have features that require additional processing, we will now perform concordancing based on on transcripts. As a first step, we load five example transcripts that represent the first five files from the Irish component of the International Corpus of English.

# define corpus files
files <- paste("https://slcladal.github.io/data/ICEIrelandSample/S1A-00", 1:5, ".txt", sep = "")
transcripts <- sapply(files, function(x){
})
 . <#> Well how did the riding go tonight <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter <#> What did you call your horse <#> I can't remember <#> Oh Mary 's Town <,> oh <#> And how did Mabel do <#> Did you not see her whenever she was going over the jumps <#> There was one time her horse refused and it refused three times <#> And then <,> she got it round and she just lined it up straight and she just kicked it and she hit it with the whip <,> and over it went the last time you know <#> And Stephanie told her she was very determined and very well-ridden <&> laughter because it had refused the other times you know <#> But Stephanie wouldn't let her give up on it <#> She made her keep coming back and keep coming back <,> until <,> it jumped it you know <#> It was good <#> Yeah I 'm not so sure her jumping 's improving that much <#> She uh <,> seemed to be holding the reins very tight

The first ten lines shown above let us know that, after the header (<S1A-001 Riding>) and the symbol which indicates the start of the transcript (<I>), each utterance is preceded by a sequence which indicates the section, file, and speaker (e.g. <S1A-001$A>). The first utterance is thus uttered by speaker A in file 001 of section S1A. In addition, there are several sequences that provide meta-linguistic information which indicate the beginning of a speech unit (<#>), pauses (<,>), and laughter (<&> laughter </&>). To perform the concordancing, we need to change the format of the transcripts because the kwic function only works on character, corpus, tokens object- in their present form, the transcripts represent a list which contains vectors of strings. To change the format, we collapse the individual utterances into a single character vector for each transcript. transcripts_collapsed <- sapply(files, function(x){ # read-in text x <- readLines(x) # paste all lines together x <- paste0(x, collapse = " ") # remove superfluous white spaces x <- str_squish(x) })  . <#> Well how did the riding go tonight <#> It was good so it was <#> Just I I couldn't believe that she was going to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter <#> What did you call your horse <#> I can't remember <#> Oh Mary 's Town <,> oh <#> And how did Mabel do <#> Did you not see her whenever she was going over the jumps <#> There was one time her horse <#> He 's been married for three years and is now <{> <[> getting divorced <#> <[> No no he 's got married last year and he 's getting <{> <[> divorced <#> <[> He 's now getting divorced <#> Just right <#> A wee girl of her age like <#> Well there was a guy <#> How long did she try it for <#> An hour a a year <#> Mhm <{> <[> mhm <#> I <.> wa I want to go to Peru but uh <#> Do you <#> Oh aye <#> I 'd love to go to Peru <#> I want I want to go up the Machu Picchu before it falls off the edge of the mountain <#> Lima 's supposed to be a bit dodgy <#> Mm <#> Bet it would be <#> Mm <#> But I I just I I would like <,> Machu Picchu is collapsing <#> I don't know wh <#> Honest to God <,> I think the young ones <#> Sure they 're flying on Monday in I think it 's Shannon <#> This is from Texas <#> This English girl <#> The youngest one <,> the dentist <,> she 's married to the dentist <#> Herself and her husband <,> three children and she 's six months pregnant <#> Oh God <#> And where are they going <#> Coming to Dublin to the mother <{> <[> or 3 sy <#> Right shall we risk another beer or shall we try and <,> <{> <[> ride the bikes down there or do something like that <#> <[> Well <,> what about the provisions <#> What time <{> <[> 4 sylls <#> <[> Is is your man coming here <#> <{> <[> Yeah <#> <[> He said he would meet us here <#> Just the boat 's arriving you know a few minutes ' wa We can now extract the concordances. kwic_trans <- quanteda::kwic( # tokenize transcripts quanteda::tokens(transcripts_collapsed), # define search pattern pattern = phrase("you know"))  docname from to pre keyword post pattern https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 62 63 was only the fourth time you know < # > It was you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 204 205 it went the last time you know < # > And Stephanie you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 235 236 had refused the other times you know < # > But Stephanie you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 272 273 , > it jumped it you know < # > It was you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 602 603 that one < , > you know and starting anew fresh < you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 665 666 { > < [ > you know < / [ > < you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 736 737 > We didn't discuss it you know < S1A-001$ A > you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 922 923 on Tuesday < , > you know < # > But I you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 1,126 1,127 that she could take her you know the wee shoulder bag she you know https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt 1,257 1,258 around < , > uhm you know their timetable and < , you know

The results show that each non-alphanumeric character is counted as a single word which reduces the context of the keyword substantially. Also, the docname column contains the full path to the data which make it hard to parse the content of the table. To address the first issue, we specify the tokenizer that we will use to not disrupt the annotation too much. In addition, we clean the docname column and extract only the file name. Lastly, we will expand the context window to 10 so that we have a better understanding of the context in which the phrase was used.

kwic_trans <- quanteda::kwic(
# tokenize transcripts
quanteda::tokens(transcripts_collapsed, what = "fasterword"),
# define search
pattern = phrase("you know"),
# extend context
window = 10) %>%
# clean docnames
dplyr::mutate(docname = str_replace_all(docname, ".*/([A-Z][0-9][A-Z]-[0-9]{1,3}).txt", "\\1"))
 docname from to pre keyword post pattern S1A-001 42 43 let me jump <,> that was only the fourth time you know <#> It was great <&> laughter <#> What you know S1A-001 140 141 the whip <,> and over it went the last time you know <#> And Stephanie told her she was very determined and you know S1A-001 164 165 <&> laughter because it had refused the other times you know <#> But Stephanie wouldn't let her give up on it you know S1A-001 193 194 and keep coming back <,> until <,> it jumped it you know <#> It was good <#> Yeah I 'm not you know S1A-001 402 403 'd be far better waiting <,> for that one <,> you know and starting anew fresh <#> Yeah but I mean you know S1A-001 443 444 the best goes top of the league <,> <{> <[> you know <#> <[> So it 's like you know S1A-001 484 485 I 'm not sure now <#> We didn't discuss it you know <#> Well it sounds like more money <#> you know

Extending the context can also be used to identify the speaker that has uttered the search pattern that we are interested in. We will do just that as this is a common task in linguistics analyses.

To extract speakers, we need to follow these steps:

1. Create normal concordances of the pattern that we are interested in.

2. Generate concordances of the pattern that we are interested in with a substantially enlarged context window size.

3. Extract the speakers from the enlarged context window size.

4. Add the speakers to the normal concordances using the left-join function from the dplyr package.

kwic_normal <- quanteda::kwic(
# tokenize transcripts
quanteda::tokens(transcripts_collapsed, what = "fasterword"),
# define search
pattern = phrase("you know")) %>%
as.data.frame()
kwic_speaker <- quanteda::kwic(
# tokenize transcripts
quanteda::tokens(transcripts_collapsed, what = "fasterword"),
# define search
pattern = phrase("you know"),
# extend search window
window = 500) %>%
# convert to data frame
as.data.frame() %>%
# extract speaker (comes after $and before >) dplyr::mutate(speaker = stringr::str_replace_all(pre, ".*\\$(.*?)>.*", "\\1")) %>%
# extract speaker
dplyr::pull(speaker)
# add speaker to normal kwic
kwic_combined <- kwic_normal %>%
dplyr::mutate(speaker = kwic_speaker) %>%
# simplify docname
dplyr::mutate(docname = stringr::str_replace_all(docname, ".*/([A-Z][0-9][A-Z]-[0-9]{1,3}).txt", "\\1")) %>%
# remove superfluous columns
dplyr::select(-to, -from, -pattern)
 docname pre keyword post speaker S1A-001 was only the fourth time you know <#> It was great <&> B S1A-001 it went the last time you know <#> And Stephanie told her B S1A-001 had refused the other times you know <#> But Stephanie wouldn't let B S1A-001 until <,> it jumped it you know <#> It was good B S1A-001 <,> for that one <,> you know and starting anew fresh B S1A-001 the league <,> <{> <[> you know <#> <[> So B S1A-001 <#> We didn't discuss it you know <#> Well it sounds B S1A-001 her lesson on Tuesday <,> you know <#> But I was keeping B S1A-001 that she could take her you know the wee shoulder bag she B S1A-001 show them around <,> uhm you know their timetable and <,> give B

The resulting table shows that we have successfully extracted the speakers (identified by the letters in the speaker column) and cleaned the file names (in the docnames column).

## Customizing concordances

As R represents a fully-fledged programming environment, we can, of course, also write our own, customized concordance function. The code below shows how you could go about doing so. Note, however, that this function only works if you enter more than a single file.

mykwic <- function(txts, pattern, context) {
# activate packages
require(stringr)
require(plyr)
# list files
conc <- sapply(txts, function(x) {
# determine length of text
lngth <- as.vector(unlist(nchar(x)))
# determine position of hits
idx <- str_locate_all(x, pattern)
idx <- idx[[1]]
ifelse(nrow(idx) >= 1, idx <- idx, return("No hits found"))
# define start position of hit
token.start <- idx[,1]
# define end position of hit
token.end <- idx[,2]
# define start position of preceding context
pre.start <- ifelse(token.start-context < 1, 1, token.start-context)
# define end position of preceding context
pre.end <- token.start-1
# define start position of subsequent context
post.start <- token.end+1
# define end position of subsequent context
post.end <- ifelse(token.end+context > lngth, lngth, token.end+context)
# extract the texts defined by the positions
PreceedingContext <- substring(x, pre.start, pre.end)
Token <- substring(x, token.start, token.end)
SubsequentContext <- substring(x, post.start, post.end)
conc <- cbind(PreceedingContext, Token, SubsequentContext)
# return concordance
return(conc)
})
concdf <- ldply(conc, data.frame)
colnames(concdf)[1]<- "File"
return(concdf)
}

We can now try if this function works by searching for the sequence you know in the transcripts that we have loaded earlier. One difference between the kwic function provided by the quanteda package and the customized concordance function used here is that the kwic function uses the number of words to define the context window, while the mykwic function uses the number of characters or symbols instead (which is why we use a notably higher number to define the context window).

myconcordances <- mykwic(transcripts_collapsed, "you know", 50)
 File PreceedingContext Token SubsequentContext https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt to let me jump <,> that was only the fourth time you know <#> It was great <&> laughter <# https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt with the whip <,> and over it went the last time you know <#> And Stephanie told her she was very determine https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt ghter because it had refused the other times you know <#> But Stephanie wouldn't let her give up on it https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt k and keep coming back <,> until <,> it jumped it you know <#> It was good <#> Yeah I 'm not so https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt she 'd be far better waiting <,> for that one <,> you know and starting anew fresh <#> Yeah but https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt er 's the best goes top of the league <,> <{> <[> you know <#> <[> So it 's like https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt ot <#> I 'm not sure now <#> We didn't discuss it you know <#> Well it sounds like more money you know <#> But I was keeping her going cos I says oh I w https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt e to take it tomorrow <,> that she could take her you know the wee shoulder bag she has <#> Mhm https://slcladal.github.io/data/ICEIrelandSample/S1A-001.txt cker <,> and <,> sort of show them around <,> uhm you know their timetable and <,> give them their timetable

As this concordance function only works for more than one text, we split the text of Darwin’s On the Origin of Species into chapters and assign each section a name.

# read in text
origin_split <- origin %>%
stringr::str_squish() %>%
stringr::str_split("[CHAPTER]{7,7} [XVI]{1,7}\\. ") %>%
unlist()
origin_split <- origin_split[which(nchar(origin_split) > 2000)]
names(origin_split) <- paste0("text", 1:length(origin_split))
# inspect data
nchar(origin_split)
##  text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 text11
##  17465  69701  29396  35636  94170  73401  66349  69085  61085  58518  62094
## text12 text13 text14 text15
##  67855  51300  87340  86574

Now that we have named elements, we can search for the pattern natural selection. We also need to clean the concordance as some sections do not contain any instances of the search pattern. To clean the data, we select only the columns File, PreceedingContext, Token, and SubsequentContext and then remove all rows where information is missing.

natsel_conc <- mykwic(origin_split, "natural selection", 50) %>%
dplyr::select(File, PreceedingContext, Token, SubsequentContext) %>%
na.omit()
 File PreceedingContext Token SubsequentContext text1 nges. CHAPTER 3. STRUGGLE FOR EXISTENCE. Bears on natural selection . The term used in a wide sense. Geometrical power text1 xternal conditions. Use and disuse, combined with natural selection ; organs of flight and of vision. Acclimatisation. text1 he immutability of species. How far the theory of natural selection may be extended. Effects of its adoption on the s text2 nd reversions of character probably do occur; but natural selection , as will hereafter be explained, will determine h text2 ntry than in the other, and thus by a process of “ natural selection ,” as will hereafter be more fully explained, two text3 ly important for us, as they afford materials for natural selection to accumulate, in the same manner as man can accu text3 have not been seized on and rendered definite by natural selection , as hereafter will be explained. Those forms whic text3 to one in which it differs more, to the action of natural selection in accumulating (as will hereafter be more fully text4 STRUGGLE FOR EXISTENCE. Bears on natural selection . The term used in a wide sense. Geometrical power text5 her useful nor injurious would not be affected by natural selection , and would be left a fluctuating element, as perh

You can go ahead and modify the customized concordance function to suit your needs.

# Citation & Session Info

Schweinberger, Martin. 2022. Concordancing with R. Brisbane: The University of Queensland. url: https://slcladal.github.io/kwics.html (Version 2022.05.21).

@manual{schweinberger2022kwics,
author = {Schweinberger, Martin},
title = {Concordancing with R},
year = {2022},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
edition = {2022.05.21}
}
sessionInfo()
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base
##
## other attached packages:
##  [1] plyr_1.8.7       flextable_0.7.0  forcats_0.5.1    stringr_1.4.0
##  [5] dplyr_1.0.9      purrr_0.3.4      readr_2.1.2      tidyr_1.2.0
##  [9] tibble_3.1.7     ggplot2_3.3.6    tidyverse_1.3.1  gutenbergr_0.2.1
## [13] quanteda_3.2.1
##
## loaded via a namespace (and not attached):
##  [1] httr_1.4.3         sass_0.4.1         bit64_4.0.5        vroom_1.5.7
##  [5] jsonlite_1.8.0     modelr_0.1.8       bslib_0.3.1        RcppParallel_5.1.5
##  [9] assertthat_0.2.1   highr_0.9          renv_0.15.4        cellranger_1.1.0
## [13] yaml_2.3.5         gdtools_0.2.4      pillar_1.7.0       backports_1.4.1
## [17] lattice_0.20-45    glue_1.6.2         uuid_1.1-0         digest_0.6.29
## [21] rvest_1.0.2        colorspace_2.0-3   htmltools_0.5.2    Matrix_1.4-1
## [25] pkgconfig_2.0.3    broom_0.8.0        haven_2.5.0        scales_1.2.0
## [29] officer_0.4.2      tzdb_0.3.0         generics_0.1.2     ellipsis_0.3.2
## [33] withr_2.5.0        lazyeval_0.2.2     klippy_0.0.0.9500  cli_3.3.0
## [37] magrittr_2.0.3     crayon_1.5.1       readxl_1.4.0       evaluate_0.15
## [41] stopwords_2.3      fs_1.5.2           fansi_1.0.3        xml2_1.3.3
## [45] tools_4.2.0        data.table_1.14.2  hms_1.1.1          lifecycle_1.0.1
## [49] munsell_0.5.0      reprex_2.0.1       zip_2.2.0          compiler_4.2.0
## [53] jquerylib_0.1.4    systemfonts_1.0.4  rlang_1.0.2        grid_4.2.0
## [57] rstudioapi_0.13    base64enc_0.1-3    rmarkdown_2.14     gtable_0.3.0
## [61] DBI_1.1.2          R6_2.5.1           lubridate_1.8.0    knitr_1.39
## [65] bit_4.0.4          fastmap_1.1.0      utf8_1.2.2         fastmatch_1.1-3
## [69] stringi_1.7.6      parallel_4.2.0     Rcpp_1.0.8.3       vctrs_0.4.1
## [73] dbplyr_2.1.1       tidyselect_1.1.2   xfun_0.30

Back to HOME

# References

Anthony, Laurence. 2004. “AntConc: A Learner and Classroom Friendly, Multi-Platform Corpus Analysis Toolkit.” Proceedings of IWLeL, 7–13.
Aschwanden, Christie. 2018. “Psychology’s Replication Crisis Has Made the Field Better.” https://fivethirtyeight.com/features/psychologys-replication-crisis-has-made-the-field-better/.
Barlow, Michael. 1999. “Monoconc 1.5 and Paraconc.” International Journal of Corpus Linguistics 4 (1): 173–84.
———. 2002. “ParaConc: Concordance Software for Multilingual Parallel Corpora.” In Proceedings of the Third International Conference on Language Resources and Evaluation. Workshop on Language Resources in Translation Work and Research, 20–24.
Diener, Edward, and Robert Biswas-Diener. 2019. “The Replication Crisis in Psychology.” https://nobaproject.com/modules/the-replication-crisis-in-psychology.
Fanelli, Daniele. 2009. “How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data.” PLoS One 4: e5738.
Kilgarriff, Adam, Pavel Rychly, Pavel Smrz, and David Tugwell. 2004. “Itri-04-08 the Sketch Engine.” Information Technology 105: 116.
Lindquist, Hans. 2009. Corpus Linguistics and the Description of English. Edinburgh: Edinburgh University Press.
McRae, Mike. 2018. “Science’s ’Replication Crisis’ Has Reached Even the Most Respectable Journals, Report Shows.” https://www.sciencealert.com/replication-results-reproducibility-crisis-science-nature-journals.
Stefanowitsch, Anatol. 2020. Corpus Linguistics. A Guide to the Methodology. Textbooks in Language Sciences. Berlin: Language Science Press.
Stroube, Bryan. 2003. “Literary Freedom: Project Gutenberg.” XRDS: Crossroads, The ACM Magazine for Students 10 (1): 3–3.
Velasco, Emily. 2019. “Researcher Discusses the the Science Replication Crisis.” https://phys.org/news/2018-11-discusses-science-replication-crisis.html.
Yong, Ed. 2018. “Psychology’s Replication Crisis Is Running Out of Excuses. Another Big Project Has Found That Only Half of Studies Can Be Repeated. And This Time, the Usual Explanations Fall Flat.” https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/.