This tutorial shows how to download and clean works from the Project Gutenberg archive using R. Project Gutenberg is a data base which contains roughly 60,000 texts for which the US copyright has expired. The entire R-markdown document for the sections below can be downloaded here.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# install libraries
install.packages("tidyverse")
install.packages("gutenbergr")
install.packages("DT")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")
Now that we have installed the packages, we activate them as shown below.
# activate packages
library(tidyverse)
library(gutenbergr)
library(DT)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.
In a first step, we inspect which works are available for download.
We can do this by typing gutenberg()
or simply
gutenberg_metadata
into the console which will output a
table containing all available texts.
gutenberg_metadata
The table below shows the first 15 lines of the overview table which shows all available texts. As there are currently 51,997 texts available, we limit the output here to 15.
To find all works by a specific author, you need to specify the
author in the gutenberg_works
function as shown
below.
# load data
darwin <- gutenberg_works(author == "Darwin, Charles")
To find all texts in, for example, German, you need to specify the
language in the gutenberg_works
function as shown
below.
# load data
gutenberg_works(languages = "de", all_languages = TRUE) %>%
dplyr::count(language)
## # A tibble: 1 × 2
## language n
## <chr> <int>
## 1 de 1342
To download any of these text, you need to specify the text you want,
e.g. by specifying the title. In a next step, you can then use the
gutenberg_download
function to download the text. To
exemplify how this works we download William Shakespeare’s Romeo and
Juliet.
# load data
romeo <- gutenberg_works(title == "Romeo and Juliet") %>%
gutenberg_download(meta_fields = "title")
gutenberg_id | text | title |
1,513 | THE TRAGEDY OF ROMEO AND JULIET | Romeo and Juliet |
1,513 | Romeo and Juliet | |
1,513 | Romeo and Juliet | |
1,513 | Romeo and Juliet | |
1,513 | by William Shakespeare | Romeo and Juliet |
1,513 | Romeo and Juliet | |
1,513 | Romeo and Juliet | |
1,513 | Contents | Romeo and Juliet |
1,513 | Romeo and Juliet | |
1,513 | THE PROLOGUE. | Romeo and Juliet |
1,513 | Romeo and Juliet | |
1,513 | ACT I | Romeo and Juliet |
1,513 | Scene I. A public place. | Romeo and Juliet |
1,513 | Scene II. A Street. | Romeo and Juliet |
1,513 | Scene III. Room in Capulet’s House. | Romeo and Juliet |
We could also use the gutenberg_id to download this text.
# load data
romeo <- gutenberg_works(gutenberg_id == "1513") %>%
gutenberg_download(meta_fields = "gutenberg_id")
gutenberg_id | text |
1,513 | THE TRAGEDY OF ROMEO AND JULIET |
1,513 | |
1,513 | |
1,513 | |
1,513 | by William Shakespeare |
1,513 | |
1,513 | |
1,513 | Contents |
1,513 | |
1,513 | THE PROLOGUE. |
1,513 | |
1,513 | ACT I |
1,513 | Scene I. A public place. |
1,513 | Scene II. A Street. |
1,513 | Scene III. Room in Capulet’s House. |
To load more than one text, you can use the |
(or)
operator to inform R that you want to download the text with the
gutenberg_id 768 (Wuthering Heights and the text with
the gutenberg_id 1260 which is Jane Eyre (the former
is from Emily and the latter from Charlotte Brontë).1
texts <- gutenberg_download(c(768, 1260), meta_fields = "title",
mirror = "http://mirrors.xmission.com/gutenberg/")
## Text NumberOfLines
## 1 Wuthering Heights 12342
## 2 Jane Eyre 21001
Feel free to have a look at different texts provided by the Project Gutenberg!
Schweinberger, Martin. 2022. Downloading Texts from Project Gutenberg using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/gutenberg.html (Version 2022.10.28).
@manual{schweinberger2022gb,
author = {Schweinberger, Martin},
title = {Downloading Texts from Project Gutenberg using R},
note = {https://ladal.edu.au/gutenberg.html},
year = {2022},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2022.10.28}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] flextable_0.8.2 DT_0.26 gutenbergr_0.2.1 forcats_0.5.2
## [5] stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4 readr_2.1.2
## [9] tidyr_1.2.0 tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.2 lubridate_1.8.0 bit64_4.0.5
## [4] httr_1.4.4 tools_4.2.1 backports_1.4.1
## [7] bslib_0.4.0 utf8_1.2.2 R6_2.5.1
## [10] DBI_1.1.3 lazyeval_0.2.2 colorspace_2.0-3
## [13] withr_2.5.0 tidyselect_1.1.2 bit_4.0.4
## [16] curl_4.3.2 compiler_4.2.1 cli_3.3.0
## [19] rvest_1.0.3 xml2_1.3.3 officer_0.4.4
## [22] triebeard_0.3.0 sass_0.4.2 scales_1.2.1
## [25] systemfonts_1.0.4 digest_0.6.29 rmarkdown_2.16
## [28] base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.5.3
## [31] dbplyr_2.2.1 fastmap_1.1.0 highr_0.9
## [34] htmlwidgets_1.5.4 rlang_1.0.4 readxl_1.4.1
## [37] rstudioapi_0.14 jquerylib_0.1.4 generics_0.1.3
## [40] jsonlite_1.8.0 crosstalk_1.2.0 vroom_1.5.7
## [43] zip_2.2.0 googlesheets4_1.0.1 magrittr_2.0.3
## [46] Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3
## [49] gdtools_0.2.4 lifecycle_1.0.1 stringi_1.7.8
## [52] yaml_2.3.5 grid_4.2.1 parallel_4.2.1
## [55] crayon_1.5.1 haven_2.5.1 hms_1.1.2
## [58] knitr_1.40 klippy_0.0.0.9500 pillar_1.8.1
## [61] uuid_1.1-0 reprex_2.0.2 glue_1.6.2
## [64] evaluate_0.16 data.table_1.14.2 modelr_0.1.9
## [67] vctrs_0.4.1 tzdb_0.3.0 urltools_1.7.3
## [70] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
## [73] cachem_1.0.6 xfun_0.32 broom_1.0.0
## [76] googledrive_2.0.0 gargle_1.2.0 ellipsis_0.3.2
I would like to thank Max Lauber for pointing out that I wrongly stated that both works were written by Jane Austen in an earlier version of this tutorial.↩︎