1 Introduction

This tutorial introduces regular expressions and how they can be used when working with language data. The entire R Notebook for the sections below can be downloaded here.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to use regular expression (or wild cards) in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful functions and methods associated with regular expressions.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.

Here is a link to an interactive version of this tutorial on Google Colab. The interactive tutorial is based on a Jupyter notebook of this tutorial. This interactive Jupyter notebook allows you to execute code yourself and - if you copy the Jupyter notebook - you can also change and edit the notebook, e.g. you can change code and upload your own data.

How can you search texts for complex patterns or combinations of patterns? This question will answered in this tutorial and at the end you will be able to perform very complex searches yourself. The key concept of this tutorial is that of a regular expression. A regular expression (in short also called regex or regexp) is a special sequence of characters (or string) for describing a search pattern. You can think of regular expressions as very powerful combinations of wildcards or as wildcards on steroids.

If you would like to get deeper into regular expressions, I can recommend Friedl (2006) and, in particular, chapter 17 of Peng (2020) for further study (although the latter uses base R rather than tidyverse functions, but this does not affect the utility of the discussion of regular expressions in any major or meaningful manner). Also, here is a so-called cheatsheet about regular expressions written by Ian Kopacka and provided by RStudio. Nick Thieberger has also recorded a very nice Introduction to Regular Expressions for humanities scholars to YouTube.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages("tidyverse")
install.packages("flextable")
install.packages("htmlwidgets")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

In a next step, we load the packages.

library(tidyverse)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed RStudio and have initiated the session by executing the code shown above, you are good to go.

2 Getting started with Regular Expressions

To put regular expressions into practice, we need some text that we will perform out searches on. In this tutorial, we will use texts from wikipedia about grammar.

# read in first text
text1 <- readLines("https://slcladal.github.io/data/testcorpus/linguistics02.txt")
et <-  paste(text1, sep = " ", collapse = " ")
# inspect example text
et
## [1] "Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics."

In addition, we will split the example text into words to have another resource we can use to understand regular expressions

# split example text
set <- str_split(et, " ") %>%
  unlist()
# inspect
head(set)
## [1] "Grammar" "is"      "a"       "system"  "of"      "rules"

Before we delve into using regular expressions, we will have a look at the regular expressions that can be used in R and also check what they stand for.

There are three basic types of regular expressions:

  • regular expressions that stand for individual symbols and determine frequencies

  • regular expressions that stand for classes of symbols

  • regular expressions that stand for structural properties

The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.

The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.

The regular expressions that denote classes of symbols are enclosed in [] and :. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.

3 Practice

In this section, we will explore how to use regular expressions. At the end, we will go through some exercises to help you understand how you can best utilize regular expressions.

Show all words in the split example text that contain a or n.

set[str_detect(set, "[an]")]
##  [1] "Grammar"      "a"            "governs"      "production"   "and"         
##  [6] "utterances"   "in"           "a"            "given"        "language."   
## [11] "apply"        "sound"        "as"           "as"           "meaning,"    
## [16] "and"          "include"      "componential" "as"           "pertaining"  
## [21] "phonology"    "organisation" "phonetic"     "sound"        "formation"   
## [26] "and"          "composition"  "and"          "syntax"       "formation"   
## [31] "and"          "composition"  "phrases"      "and"          "sentences)." 
## [36] "Many"         "modern"       "that"         "deal"         "principles"  
## [41] "grammar"      "are"          "based"        "on"           "Noam"        
## [46] "framework"    "generative"   "linguistics."

Show all words in the split example text that begin with a lower case a.

set[str_detect(set, "^a")]
##  [1] "a"     "and"   "a"     "apply" "as"    "as"    "and"   "as"    "and"  
## [10] "and"   "and"   "and"   "are"

Show all words in the split example text that end in a lower case s.

set[str_detect(set, "s$")]
##  [1] "is"         "rules"      "governs"    "utterances" "rules"     
##  [6] "as"         "as"         "subsets"    "as"         "phrases"   
## [11] "theories"   "principles" "Chomsky's"

Show all words in the split example text in which there is an e, then any other character, and than another n.

set[str_detect(set, "e.n")]
## [1] "governs"  "meaning," "modern"

Show all words in the split example text in which there is an e, then two other characters, and than another n.

set[str_detect(set, "e.{2,2}n")]
## [1] "utterances"

Show all words that consist of exactly three alphabetical characters in the split example text.

set[str_detect(set, "^[:alpha:]{3,3}$")]
##  [1] "the" "and" "use" "and" "and" "and" "and" "and" "the" "are"

Show all words that consist of six or more alphabetical characters in the split example text.

set[str_detect(set, "^[:alpha:]{6,}$")]
##  [1] "Grammar"      "system"       "governs"      "production"   "utterances"  
##  [6] "include"      "componential" "subsets"      "pertaining"   "phonology"   
## [11] "organisation" "phonetic"     "morphology"   "formation"    "composition" 
## [16] "syntax"       "formation"    "composition"  "phrases"      "modern"      
## [21] "theories"     "principles"   "grammar"      "framework"    "generative"

Replace all lower case as with upper case Es in the example text.

str_replace_all(et, "a", "E")
## [1] "GrEmmEr is E system of rules which governs the production End use of utterEnces in E given lEnguEge. These rules Epply to sound Es well Es meEning, End include componentiEl subsets of rules, such Es those pertEining to phonology (the orgEnisEtion of phonetic sound systems), morphology (the formEtion End composition of words), End syntEx (the formEtion End composition of phrEses End sentences). MEny modern theories thEt deEl with the principles of grEmmEr Ere bEsed on NoEm Chomsky's frEmework of generEtive linguistics."

Remove all non-alphabetical characters in the split example text.

str_remove_all(set, "\\W")
##  [1] "Grammar"      "is"           "a"            "system"       "of"          
##  [6] "rules"        "which"        "governs"      "the"          "production"  
## [11] "and"          "use"          "of"           "utterances"   "in"          
## [16] "a"            "given"        "language"     "These"        "rules"       
## [21] "apply"        "to"           "sound"        "as"           "well"        
## [26] "as"           "meaning"      "and"          "include"      "componential"
## [31] "subsets"      "of"           "rules"        "such"         "as"          
## [36] "those"        "pertaining"   "to"           "phonology"    "the"         
## [41] "organisation" "of"           "phonetic"     "sound"        "systems"     
## [46] "morphology"   "the"          "formation"    "and"          "composition" 
## [51] "of"           "words"        "and"          "syntax"       "the"         
## [56] "formation"    "and"          "composition"  "of"           "phrases"     
## [61] "and"          "sentences"    "Many"         "modern"       "theories"    
## [66] "that"         "deal"         "with"         "the"          "principles"  
## [71] "of"           "grammar"      "are"          "based"        "on"          
## [76] "Noam"         "Chomskys"     "framework"    "of"           "generative"  
## [81] "linguistics"

Remove all white spaces in the example text.

str_remove_all(et, " ")
## [1] "Grammarisasystemofruleswhichgovernstheproductionanduseofutterancesinagivenlanguage.Theserulesapplytosoundaswellasmeaning,andincludecomponentialsubsetsofrules,suchasthosepertainingtophonology(theorganisationofphoneticsoundsystems),morphology(theformationandcompositionofwords),andsyntax(theformationandcompositionofphrasesandsentences).ManymoderntheoriesthatdealwiththeprinciplesofgrammararebasedonNoamChomsky'sframeworkofgenerativelinguistics."

Highlighting patterns

We use the str_view and str_view_all functions to show the occurrences of regular expressions in the example text.

To begin with, we match an exactly defined pattern (ang).

str_view_all(et, "ang")

Now, we include . which stands for any symbol (except a new line symbol).

str_view_all(et, ".n.")

EXERCISE TIME!

`

  1. What regular expression can you use to extract all forms of walk from a text?
Answer [Ww][Aa][Ll][Kk].*

More exercises will follow - bear with us ;)

`


Citation & Session Info

Schweinberger, Martin. 2022. Regular Expressions in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/regex.html (Version 2022.07.30).

@manual{schweinberger2022regex,
  author = {Schweinberger, Martin},
  title = {Regular Expressions in R},
  note = {https://slcladal.github.io/regex.html},
  year = {2022},
  organization = {The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.07.30}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] flextable_0.7.0 forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9    
##  [5] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.7   
##  [9] ggplot2_3.3.6   tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3      lubridate_1.8.0   assertthat_0.2.1  digest_0.6.29    
##  [5] utf8_1.2.2        R6_2.5.1          cellranger_1.1.0  backports_1.4.1  
##  [9] reprex_2.0.1      evaluate_0.15     httr_1.4.3        highr_0.9        
## [13] pillar_1.7.0      gdtools_0.2.4     rlang_1.0.2       uuid_1.1-0       
## [17] readxl_1.4.0      rstudioapi_0.13   data.table_1.14.2 jquerylib_0.1.4  
## [21] klippy_0.0.0.9500 rmarkdown_2.14    htmlwidgets_1.5.4 munsell_0.5.0    
## [25] broom_0.8.0       compiler_4.2.1    modelr_0.1.8      xfun_0.30        
## [29] systemfonts_1.0.4 pkgconfig_2.0.3   base64enc_0.1-3   htmltools_0.5.2  
## [33] tidyselect_1.1.2  fansi_1.0.3       crayon_1.5.1      tzdb_0.3.0       
## [37] dbplyr_2.1.1      withr_2.5.0       grid_4.2.1        jsonlite_1.8.0   
## [41] gtable_0.3.0      lifecycle_1.0.1   DBI_1.1.2         magrittr_2.0.3   
## [45] scales_1.2.0      zip_2.2.0         cli_3.3.0         stringi_1.7.6    
## [49] renv_0.15.4       fs_1.5.2          xml2_1.3.3        bslib_0.3.1      
## [53] ellipsis_0.3.2    generics_0.1.2    vctrs_0.4.1       tools_4.2.1      
## [57] glue_1.6.2        officer_0.4.2     hms_1.1.1         fastmap_1.1.0    
## [61] yaml_2.3.5        colorspace_2.0-3  rvest_1.0.2       knitr_1.39       
## [65] haven_2.5.0       sass_0.4.1

Back to top

Back to HOME


References

Friedl, Jeffrey EF. 2006. Mastering Regular Expressions. Sebastopol, CA: "O’Reilly Media".
Peng, Roger D. 2020. R Programming for Data Science. Leanpub. https://bookdown.org/rdpeng/rprogdatascience/.