Introduction

This tutorial shows how to download and clean works from the Project Gutenberg archive using R. Project Gutenberg is a data base which contains roughly 60,000 texts for which the US copyright has expired. The entire R-markdown document for the sections below can be downloaded here.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# install libraries
install.packages("tidyverse")
install.packages("gutenbergr")
install.packages("DT")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we activate them as shown below.

# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# activate packages
library(tidyverse)
library(gutenbergr)
library(DT)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.

1 Project Gutenberg

In a first step, we inspect which works are available for download. We can do this by typing gutenberg() or simply gutenberg_metadata into the console which will output a table containing all available texts.

gutenberg_metadata

The table below shows the first 15 lines of the overview table which shows all available texts. As there are currently 51,997 texts available, we limit the output here to 15.

To find all works by a specific author, you need to specify the author in the gutenberg_works function as shown below.

# load data
darwin <- gutenberg_works(author == "Darwin, Charles")

To find all texts in, for example, German, you need to specify the language in the gutenberg_works function as shown below.

# load data
gutenberg_works(languages = "de", all_languages = TRUE) %>%
  dplyr::count(language)
## # A tibble: 1 × 2
##   language     n
##   <chr>    <int>
## 1 de        1342

2 Loading individual texts

To download any of these text, you need to specify the text you want, e.g. by specifying the title. In a next step, you can then use the gutenberg_download function to download the text. To exemplify how this works we download William Shakespeare’s Romeo and Juliet.

# load data
romeo <- gutenberg_works(title == "Romeo and Juliet") %>%
  gutenberg_download(meta_fields = "title")

We could also use the gutenberg_id to download this text.

# load data
romeo <- gutenberg_works(gutenberg_id == "1513") %>%
  gutenberg_download(meta_fields = "gutenberg_id")

3 Loading texts simultaneously

To load more than one text, you can use the | (or) operator to inform R that you want to download the text with the gutenberg_id 768 (Wuthering Heights and the text with the gutenberg_id 1260 which is Jane Eyre (the former is from Emily and the latter from Charlotte Brontë).1

texts <- gutenberg_download(c(768, 1260), meta_fields = "title", 
                            mirror = "http://mirrors.xmission.com/gutenberg/")
##                Text NumberOfLines
## 1 Wuthering Heights         12342
## 2         Jane Eyre         21001

Feel free to have a look at different texts provided by the Project Gutenberg!

Citation & Session Info

Schweinberger, Martin. 2022. Downloading Texts from Project Gutenberg using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/gutenberg.html (Version 2022.08.16).

@manual{schweinberger2022gb,
  author = {Schweinberger, Martin},
  title = {Downloading Texts from Project Gutenberg using R},
  note = {https://slcladal.github.io/gutenberg.html},
  year = {2022},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.08.16}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] flextable_0.7.0  DT_0.23          gutenbergr_0.2.1 forcats_0.5.1   
##  [5] stringr_1.4.0    dplyr_1.0.9      purrr_0.3.4      readr_2.1.2     
##  [9] tidyr_1.2.0      tibble_3.1.7     ggplot2_3.3.6    tidyverse_1.3.1 
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.2          lubridate_1.8.0   bit64_4.0.5       httr_1.4.3       
##  [5] tools_4.2.1       backports_1.4.1   bslib_0.3.1       utf8_1.2.2       
##  [9] R6_2.5.1          DBI_1.1.2         lazyeval_0.2.2    colorspace_2.0-3 
## [13] withr_2.5.0       tidyselect_1.1.2  bit_4.0.4         curl_4.3.2       
## [17] compiler_4.2.1    cli_3.3.0         rvest_1.0.2       xml2_1.3.3       
## [21] officer_0.4.2     triebeard_0.3.0   sass_0.4.1        scales_1.2.0     
## [25] systemfonts_1.0.4 digest_0.6.29     rmarkdown_2.14    base64enc_0.1-3  
## [29] pkgconfig_2.0.3   htmltools_0.5.2   dbplyr_2.1.1      fastmap_1.1.0    
## [33] highr_0.9         htmlwidgets_1.5.4 rlang_1.0.2       readxl_1.4.0     
## [37] rstudioapi_0.13   jquerylib_0.1.4   generics_0.1.2    jsonlite_1.8.0   
## [41] crosstalk_1.2.0   vroom_1.5.7       zip_2.2.0         magrittr_2.0.3   
## [45] Rcpp_1.0.8.3      munsell_0.5.0     fansi_1.0.3       gdtools_0.2.4    
## [49] lifecycle_1.0.1   stringi_1.7.6     yaml_2.3.5        grid_4.2.1       
## [53] parallel_4.2.1    crayon_1.5.1      haven_2.5.0       hms_1.1.1        
## [57] knitr_1.39        klippy_0.0.0.9500 pillar_1.7.0      uuid_1.1-0       
## [61] reprex_2.0.1      glue_1.6.2        evaluate_0.15     data.table_1.14.2
## [65] renv_0.15.4       modelr_0.1.8      vctrs_0.4.1       tzdb_0.3.0       
## [69] urltools_1.7.3    cellranger_1.1.0  gtable_0.3.0      assertthat_0.2.1 
## [73] xfun_0.30         broom_0.8.0       ellipsis_0.3.2

Back to top

Back to HOME



  1. I would like to thank Max Lauber for pointing out that I wrongly stated that both workes were written by Jane Austen in an earlier version of this tutorial.↩︎