# install libraries
install.packages("tidyverse")
install.packages("gutenbergr")
install.packages("DT")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
::install_github("rlesur/klippy") remotes
Downloading Texts from Project Gutenberg using R
Introduction
This tutorial shows how to download and clean works from the Project Gutenberg archive using R. Project Gutenberg is a data base which contains roughly 60,000 texts for which the US copyright has expired. The entire R-markdown document for the sections below can be downloaded here.
Preparation and session set up
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
Now that we have installed the packages, we activate them as shown below.
# activate packages
library(tidyverse)
library(gutenbergr)
library(DT)
library(flextable)
# activate klippy for copy-to-clipboard button
::klippy() klippy
Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.
Project Gutenberg
In a first step, we inspect which works are available for download. We can do this by typing gutenberg()
or simply gutenberg_metadata
into the console which will output a table containing all available texts.
gutenberg_metadata
The table below shows the first 15 lines of the overview table which shows all available texts. As there are currently 51,997 texts available, we limit the output here to 15.
To find all works by a specific author, you need to specify the author in the gutenberg_works
function as shown below.
# load data
<- gutenberg_works(author == "Darwin, Charles") darwin
To find all texts in, for example, German, you need to specify the language in the gutenberg_works
function as shown below.
# load data
gutenberg_works(languages = "de", all_languages = TRUE) %>%
::count(language) dplyr
# A tibble: 1 × 2
language n
<chr> <int>
1 de 1296
Loading individual texts
To download any of these text, you need to specify the text you want, e.g. by specifying the title. In a next step, you can then use the gutenberg_download
function to download the text. To exemplify how this works we download William Shakespeare’s Romeo and Juliet.
# load data
<- gutenberg_works(title == "Romeo and Juliet") %>%
romeo gutenberg_download(meta_fields = "title")
gutenberg_id | text | title |
---|
We could also use the gutenberg_id to download this text.
# load data
<- gutenberg_works(gutenberg_id == "1513") %>%
romeo gutenberg_download(meta_fields = "gutenberg_id")
gutenberg_id | text |
---|
Loading texts simultaneously
To load more than one text, you can use the |
(or) operator to inform R that you want to download the text with the gutenberg_id 768 (Wuthering Heights and the text with the gutenberg_id 1260 which is Jane Eyre (the former is from Emily and the latter from Charlotte Brontë).1
<- gutenberg_download(c(768, 1260),
texts meta_fields = "title",
mirror = "http://mirrors.xmission.com/gutenberg/"
)
Text NumberOfLines
1 Wuthering Heights 12342
Feel free to have a look at different texts provided by the Project Gutenberg!
Citation & Session Info
Schweinberger, Martin. 2025. Downloading Texts from Project Gutenberg using R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/gutenberg/gutenberg.html (Version 2025.04.01).
@manual{schweinberger2025gb,
author = {Schweinberger, Martin},
title = {Downloading Texts from Project Gutenberg using R},
note = {https://ladal.edu.au/tutorials/gutenberg/gutenberg.html},
year = {2025},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2025.04.01}
}
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS 15.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
time zone: Australia/Sydney
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] flextable_0.9.7 DT_0.33 gutenbergr_0.2.4 lubridate_1.9.4
[5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.50 bslib_0.8.0
[4] htmlwidgets_1.6.4 tzdb_0.5.0 vctrs_0.6.5
[7] tools_4.4.0 crosstalk_1.2.1 generics_0.1.3
[10] curl_6.2.2 parallel_4.4.0 klippy_0.0.0.9500
[13] pkgconfig_2.0.3 data.table_1.16.4 assertthat_0.2.1
[16] uuid_1.2-1 lifecycle_1.0.4 compiler_4.4.0
[19] textshaping_1.0.0 munsell_0.5.1 codetools_0.2-20
[22] fontquiver_0.2.1 fontLiberation_0.1.0 htmltools_0.5.8.1
[25] sass_0.4.9 lazyeval_0.2.2 yaml_2.3.10
[28] crayon_1.5.3 pillar_1.10.1 jquerylib_0.1.4
[31] openssl_2.3.1 cachem_1.1.0 fontBitstreamVera_0.1.1
[34] tidyselect_1.2.1 zip_2.3.1 digest_0.6.37
[37] stringi_1.8.4 fastmap_1.2.0 grid_4.4.0
[40] colorspace_2.1-1 cli_3.6.4 magrittr_2.0.3
[43] triebeard_0.4.1 utf8_1.2.4 withr_3.0.2
[46] gdtools_0.4.1 scales_1.3.0 bit64_4.6.0-1
[49] timechange_0.3.0 rmarkdown_2.29 officer_0.6.7
[52] bit_4.5.0.1 askpass_1.2.1 ragg_1.3.3
[55] hms_1.1.3 evaluate_1.0.3 knitr_1.49
[58] urltools_1.7.3 rlang_1.1.5 Rcpp_1.0.14
[61] glue_1.8.0 xml2_1.3.6 renv_1.1.4
[64] vroom_1.6.5 jsonlite_1.8.9 R6_2.5.1
[67] systemfonts_1.2.1
Footnotes
I would like to thank Max Lauber for pointing out that I wrongly stated that both works were written by Jane Austen in an earlier version of this tutorial.↩︎