This tutorial introduces regular expressions and how they can be used when working with language data. Regular expressions are powerful tools used to search and manipulate text patterns. They provide a way to find specific sequences of characters within larger bodies of text. Think of them as search patterns on steroids. Regular expressions are useful for tasks like extracting specific words, finding patterns, or replacing text in bulk. They offer a concise and flexible way to describe complex text patterns using symbols and special characters. Regular expressions have applications in linguistics and humanities research, aiding in tasks such as text analysis, corpus linguistics, and language processing. Understanding regular expressions can unlock new possibilities for exploring and analyzing textual data.
This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to use regular expression (or wild cards) in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful functions and methods associated with regular expressions.
To be able to follow this tutorial, we suggest you check out and
familiarize yourself with the content of the following R
Basics tutorials:
Click here1 to
download the entire R Notebook for this
tutorial.
Click
here
to open an interactive Jupyter notebook that allows you to execute,
change, and edit the code as well as to upload your own data.
How can you search texts for complex patterns or combinations of patterns? This question will answered in this tutorial and at the end you will be able to perform very complex searches yourself. The key concept of this tutorial is that of a regular expression. A regular expression (in short also called regex or regexp) is a special sequence of characters (or string) for describing a search pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids.
If you would like to get deeper into regular expressions, I can recommend Friedl (2006) and, in particular, chapter 17 of Peng (2020) for further study (although the latter uses base R rather than tidyverse functions, but this does not affect the utility of the discussion of regular expressions in any major or meaningful manner). Also, here is a so-called cheatsheet about regular expressions written by Ian Kopacka and provided by RStudio. Nick Thieberger has also recorded a very nice Introduction to Regular Expressions for humanities scholars to YouTube.
Preparation and session set up
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages("tidyverse")
install.packages("flextable")
install.packages("htmlwidgets")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")
In a next step, we load the packages.
library(tidyverse)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed RStudio and have initiated the session by executing the code shown above, you are good to go.
To put regular expressions into practice, we need some text that we will perform out searches on. In this tutorial, we will use texts from wikipedia about grammar.
# read in first text
text1 <- readLines("https://slcladal.github.io/data/testcorpus/linguistics02.txt")
et <- paste(text1, sep = " ", collapse = " ")
# inspect example text
et
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."
In addition, we will split the example text into words to have another resource we can use to understand regular expressions
# split example text
set <- str_split(et, " ") %>%
unlist()
# inspect
head(set)
## [1] "Grammar" "is" "a" "system" "of" "rules"
Before we delve into using regular expressions, we will have a look at the regular expressions that can be used in R and also check what they stand for.
There are three basic types of regular expressions:
regular expressions that stand for individual symbols and determine frequencies
regular expressions that stand for classes of symbols
regular expressions that stand for structural properties
The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.
RegEx Symbol/Sequence | Explanation | Example |
---|---|---|
? | The preceding item is optional and will be matched at most once | walk[a-z]? = walk, walks |
* | The preceding item will be matched zero or more times | walk[a-z]* = walk, walks, walked, walking |
+ | The preceding item will be matched one or more times | walk[a-z]+ = walks, walked, walking |
{n} | The preceding item is matched exactly n times | walk[a-z]{2} = walked |
{n,} | The preceding item is matched n or more times | walk[a-z]{2,} = walked, walking |
{n,m} | The preceding item is matched at least n times, but not more than m times | walk[a-z]{2,3} = walked, walking |
The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.
RegEx Symbol/Sequence | Explanation |
---|---|
[ab] | lower case a and b |
[a-z] | all lower case characters from a to z |
[AB] | upper case a and b |
[A-Z] | all upper case characters from A to Z |
[12] | digits 1 and 2 |
[0-9] | digits: 0 1 2 3 4 5 6 7 8 9 |
[:digit:] | digits: 0 1 2 3 4 5 6 7 8 9 |
[:lower:] | lower case characters: a–z |
[:upper:] | upper case characters: A–Z |
[:alpha:] | alphabetic characters: a–z and A–Z |
[:alnum:] | digits and alphabetic characters |
[:punct:] | punctuation characters: . , ; etc. |
[:graph:] | graphical characters: [:alnum:] and [:punct:] |
[:blank:] | blank characters: Space and tab |
[:space:] | space characters: Space, tab, newline, and other space characters |
The regular expressions that denote classes of symbols are enclosed
in []
and :
. The last type of regular
expressions, i.e. regular expressions that stand for structural
properties are shown below.
RegEx Symbol/Sequence | Explanation |
---|---|
\\w | Word characters: [[:alnum:]_] |
\\W | No word characters: [^[:alnum:]_] |
\\s | Space characters: [[:blank:]] |
\\S | No space characters: [^[:blank:]] |
\\d | Digits: [[:digit:]] |
\\D | No digits: [^[:digit:]] |
\\b | Word edge |
\\B | No word edge |
< | Word beginning |
> | Word end |
^ | Beginning of a string |
$ | End of a string |
In this section, we will explore how to use regular expressions. At the end, we will go through some exercises to help you understand how you can best utilize regular expressions.
Show all words in the split example text that contain a
or n
.
set[str_detect(set, "[an]")]
## [1] "Grammar" "a" "governs" "production" "and"
## [6] "utterances" "in" "a" "given" "language."
## [11] "apply" "sound" "as" "as" "meaning,"
## [16] "and" "include" "componential" "as" "pertaining"
## [21] "phonology" "organisation" "phonetic" "sound" "formation"
## [26] "and" "composition" "and" "syntax" "formation"
## [31] "and" "composition" "phrases" "and" "sentences)."
## [36] "Many" "modern" "that" "deal" "principles"
## [41] "grammar" "are" "based" "on" "Noam"
## [46] "framework" "generative" "linguistics."
Show all words in the split example text that begin with a lower case
a
.
set[str_detect(set, "^a")]
## [1] "a" "and" "a" "apply" "as" "as" "and" "as" "and"
## [10] "and" "and" "and" "are"
Show all words in the split example text that end in a lower case
s
.
set[str_detect(set, "s$")]
## [1] "is" "rules" "governs" "utterances" "rules"
## [6] "as" "as" "subsets" "as" "phrases"
## [11] "theories" "principles" "Chomsky's"
Show all words in the split example text in which there is an
e
, then any other character, and than another
n
.
set[str_detect(set, "e.n")]
## [1] "governs" "meaning," "modern"
Show all words in the split example text in which there is an
e
, then two other characters, and than another
n
.
set[str_detect(set, "e.{2,2}n")]
## [1] "utterances"
Show all words that consist of exactly three alphabetical characters in the split example text.
set[str_detect(set, "^[:alpha:]{3,3}$")]
## [1] "the" "and" "use" "and" "and" "and" "and" "and" "the" "are"
Show all words that consist of six or more alphabetical characters in the split example text.
set[str_detect(set, "^[:alpha:]{6,}$")]
## [1] "Grammar" "system" "governs" "production" "utterances"
## [6] "include" "componential" "subsets" "pertaining" "phonology"
## [11] "organisation" "phonetic" "morphology" "formation" "composition"
## [16] "syntax" "formation" "composition" "phrases" "modern"
## [21] "theories" "principles" "grammar" "framework" "generative"
Replace all lower case a
s with upper case
E
s in the example text.
str_replace_all(et, "a", "E")
## [1] "GrEmmEr is E system of rules which governs the production End use of utterEnces in E given lEnguEge. These rules Epply to sound Es well Es meEning, End include componentiEl subsets of rules, such Es those pertEining to phonology (the orgEnisEtion of phonetic sound systems), morphology (the formEtion End composition of words), End syntEx (the formEtion End composition of phrEses End sentences). MEny modern theories thEt deEl with the principles of grEmmEr Ere bEsed on NoEm Chomsky's frEmework of generEtive linguistics."
Remove all non-alphabetical characters in the split example text.
str_remove_all(set, "\\W")
## [1] "Grammar" "is" "a" "system" "of"
## [6] "rules" "which" "governs" "the" "production"
## [11] "and" "use" "of" "utterances" "in"
## [16] "a" "given" "language" "These" "rules"
## [21] "apply" "to" "sound" "as" "well"
## [26] "as" "meaning" "and" "include" "componential"
## [31] "subsets" "of" "rules" "such" "as"
## [36] "those" "pertaining" "to" "phonology" "the"
## [41] "organisation" "of" "phonetic" "sound" "systems"
## [46] "morphology" "the" "formation" "and" "composition"
## [51] "of" "words" "and" "syntax" "the"
## [56] "formation" "and" "composition" "of" "phrases"
## [61] "and" "sentences" "Many" "modern" "theories"
## [66] "that" "deal" "with" "the" "principles"
## [71] "of" "grammar" "are" "based" "on"
## [76] "Noam" "Chomskys" "framework" "of" "generative"
## [81] "linguistics"
Remove all white spaces in the example text.
str_remove_all(et, " ")
## [1] "Grammarisasystemofruleswhichgovernstheproductionanduseofutterancesinagivenlanguage.Theserulesapplytosoundaswellasmeaning,andincludecomponentialsubsetsofrules,suchasthosepertainingtophonology(theorganisationofphoneticsoundsystems),morphology(theformationandcompositionofwords),andsyntax(theformationandcompositionofphrasesandsentences).ManymoderntheoriesthatdealwiththeprinciplesofgrammararebasedonNoamChomsky'sframeworkofgenerativelinguistics."
Highlighting patterns
We use the str_view
and str_view_all
functions to show the occurrences of regular expressions in the example
text.
To begin with, we match an exactly defined pattern
(ang
).
str_view_all(et, "ang")
## [1] │ Grammar is a system of rules which governs the production and use of utterances in a given l<ang>uage. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.
Now, we include . which stands for any symbol (except a new line symbol).
str_view_all(et, ".n.")
## [1] │ Grammar is a system of rules which gove<rns> the producti<on ><and> use of utter<anc>es <in >a giv<en >l<ang>uage. These rules apply to so<und> as well as me<ani>ng, <and> <inc>lude comp<one>ntial subsets of rules, such as those perta<ini>ng to ph<ono>logy (the org<ani>sati<on >of ph<one>tic so<und> systems), morphology (the formati<on ><and> compositi<on >of words), <and> s<ynt>ax (the formati<on ><and> compositi<on >of phrases <and> s<ent><enc>es). M<any> mode<rn >theories that deal with the pr<inc>iples of grammar are based <on >Noam Chomsky's framework of g<ene>rative l<ing>uistics.
EXERCISE TIME!
`
[Ww][Aa][Ll][Kk].*
More exercises will follow - bear with us ;)
`
Schweinberger, Martin. 2022. Regular Expressions in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/regex.html (Version 2022.11.17).
@manual{schweinberger2022regex,
author = {Schweinberger, Martin},
title = {Regular Expressions in R},
note = {https://ladal.edu.au/regex.html},
year = {2022},
organization = {The University of Queensland, School of Languages and Cultures},
address = {Brisbane},
edition = {2022.11.17}
}
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] flextable_0.9.1 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
## [5] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
## [9] tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 assertthat_0.2.1 digest_0.6.31
## [4] utf8_1.2.3 mime_0.12 R6_2.5.1
## [7] evaluate_0.21 highr_0.10 pillar_1.9.0
## [10] gdtools_0.3.3 rlang_1.1.1 uuid_1.1-0
## [13] curl_5.0.0 rstudioapi_0.14 data.table_1.14.8
## [16] jquerylib_0.1.4 klippy_0.0.0.9500 rmarkdown_2.21
## [19] textshaping_0.3.6 munsell_0.5.0 shiny_1.7.4
## [22] compiler_4.2.2 httpuv_1.6.11 xfun_0.39
## [25] askpass_1.1 pkgconfig_2.0.3 systemfonts_1.0.4
## [28] gfonts_0.2.0 htmltools_0.5.5 openssl_2.0.6
## [31] tidyselect_1.2.0 fontBitstreamVera_0.1.1 httpcode_0.3.0
## [34] fansi_1.0.4 crayon_1.5.2 tzdb_0.4.0
## [37] withr_2.5.0 later_1.3.1 crul_1.4.0
## [40] grid_4.2.2 jsonlite_1.8.4 xtable_1.8-4
## [43] gtable_0.3.3 lifecycle_1.0.3 magrittr_2.0.3
## [46] scales_1.2.1 zip_2.3.0 cli_3.6.1
## [49] stringi_1.7.12 cachem_1.0.8 promises_1.2.0.1
## [52] xml2_1.3.4 bslib_0.4.2 ragg_1.2.5
## [55] ellipsis_0.3.2 generics_0.1.3 vctrs_0.6.2
## [58] tools_4.2.2 glue_1.6.2 officer_0.6.2
## [61] fontquiver_0.2.1 hms_1.1.3 fastmap_1.1.1
## [64] yaml_2.3.7 timechange_0.2.0 colorspace_2.1-0
## [67] fontLiberation_0.1.0 knitr_1.43 sass_0.4.6
If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎