This tutorial introduces web crawling and web scraping with R. Web crawling and web scraping are important and common procedures for collecting text data from social media sites, web pages, or other documents for later analysis. Regarding terminology, the automated download of HTML pages is called crawling while the extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called scraping (see Olston and Najork 2010).
This tutorial is aimed at intermediate and advanced users of R with the aim of showcasing how to crawl and scrape web data using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with crawling and scraping web data.
The entire R Notebook for the tutorial can be downloaded here.
If you want to render the R Notebook on your machine, i.e. knitting the
document to html or a pdf, you need to make sure that you have R and
RStudio installed and you also need to download the bibliography
file and store it in the same folder where you store the
Rmd file.
This tutorial builds heavily on and uses materials from this
tutorial on web crawling and scraping using R by Andreas Niekler and
Gregor Wiedemann (see Wiedemann and Niekler 2017). The tutorial by
Andreas Niekler and Gregor Wiedemann is more thorough, goes into more
detail than this tutorial, and covers many more very useful text mining
methods. An alternative approach for web crawling and scraping would be
to use the RCrawler
package (Khalil and Fakir
2017) which is not introduced here though (inspecting the
RCrawler
package and its functions is, however, also highly
recommended). For a more in-depth introduction to web crawling in
scraping, Miner et al. (2012) is
a very useful introduction.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# install packages
install.packages("rvest")
install.packages("readtext")
install.packages("webdriver")
install.packages("tidyverse")
install.packages("readtext")
install.packages("flextable")
install.packages("webdriver")
webdriver::install_phantomjs()
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")
If not done yet, please install the phantomJS headless browser. This needs to be done only once.
Now that we have installed the packages (and the phantomJS headless browser), we can activate them as shown below.
# load packages
library(tidyverse)
library(rvest)
library(readtext)
library(flextable)
library(webdriver)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed R and RStudio and once you have initiated the session by executing the code shown above, you are good to go.
For web crawling and scraping, we use the package rvest
and to extract text data from various formats such as PDF, DOC, DOCX and
TXT files with the readtext
package. In a first exercise,
we will download a single web page from The Guardian and
extract text together with relevant metadata such as the article date.
Let’s define the URL of the article of interest and load the content
using the read_html
function from the rvest
package, which provides very useful functions for web crawling and
scraping.
# define url
url <- "https://www.theguardian.com/world/2017/jun/26/angela-merkel-and-donald-trump-head-for-clash-at-g20-summit"
# download content
webc <- rvest::read_html(url)
# inspect
webc
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#maincontent" class="dcr-1ctuwgl">Skip t ...
We download and parse the webpage using the read_html
function which accepts a URL as a parameter. The function downloads the
page and interprets the html source code as an HTML / XML object.
However, the output contains a lot of information that we do not really need. Thus, we process the data to extract only the text from the webpage.
webc %>%
# extract paragraphs
rvest::html_nodes("p") %>%
# extract text
rvest::html_text() -> webtxt
# inspect
head(webtxt)
## [1] "German chancellor plans to make climate change, free trade and mass migration key themes in Hamburg, putting her on collision course with US"
## [2] "A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the management of forced mass global migration the key themes of the G20 summit in Hamburg next week."
## [3] "The G20 summit brings together the world’s biggest economies, representing 85% of global gross domestic product (GDP), and Merkel’s chosen agenda looks likely to maximise American isolation while attempting to minimise disunity amongst others. "
## [4] "The meeting, which is set to be the scene of large-scale street protests, will also mark the first meeting between Trump and the Russian president, Vladimir Putin, as world leaders."
## [5] "Trump has already rowed with Europe once over climate change and refugees at the G7 summit in Italy, and now looks set to repeat the experience in Hamburg but on a bigger stage, as India and China join in the criticism of Washington. "
## [6] "Last week, the new UN secretary-general, António Guterres, warned the Trump team if the US disengages from too many issues confronting the international community it will be replaced as world leader."
The output shows the first 6 text elements of the website which means that we were successful in scraping the text content of the web page.
We can also extract the headline of the article by running the code shown below.
webc %>%
# extract paragraphs
rvest::html_nodes("h1") %>%
# extract text
rvest::html_text() -> header
# inspect
head(header)
## [1] "Angela Merkel and Donald Trump head for clash at G20 summit"
Modern websites often do not contain the full content displayed in
the browser in their corresponding source files which are served by the
web-server. Instead, the browser loads additional content dynamically
via javascript code contained in the original source file. To be able to
scrape such content, we rely on a headless browser
phantomJS
which renders a site for a given URL for us,
before we start the actual scraping, i.e. the extraction of certain
identifiable elements from the rendered site.
NOTE
In case the website does not fetch or alter the
to-be-scraped content dynamically, you can omit the PhantomJS webdriver
and just download the the static HTML source code to retrieve the
information from there. In this case, replace the following block of
code with a simple call of
html_document <- read_html(url)
where the
read_html()
function downloads the unrendered page source
code directly.
Now we can start an instance of PhantomJS
and create a
new browser session that awaits to load URLs to render the corresponding
websites.
pjs_instance <- run_phantomjs()
pjs_session <- Session$new(port = pjs_instance$port)
To make sure that we get the dynamically rendered HTML content of the
website, we pass the original source code downloaded from the URL to our
PhantomJS
session first, and the use the rendered
source.
Usually, we do not want download a single document, but a series of documents. In our second exercise, we want to download all Guardian articles tagged with Angela Merkel. Instead of a tag page, we could also be interested in downloading results of a site-search engine or any other link collection. The task is always two-fold:
First, we download and parse the tag overview page to extract all links to articles of interest:
url <- "https://www.theguardian.com/world/angela-merkel"
# go to URL
pjs_session$go(url)
# render page
rendered_source <- pjs_session$getSource()
# download text and parse the source code into an XML object
html_document <- read_html(rendered_source)
Second, we download and scrape each individual article page. For
this, we extract all href
-attributes from
a
-elements fitting a certain CSS-class. To select the right
contents via XPATH-selectors, you need to investigate the HTML-structure
of your specific page. Modern browsers such as Firefox and Chrome
support you in that task by a function called “Inspect Element” (or
similar), available through a right-click on the page element.
links <- html_document %>%
html_nodes(xpath = "//div[contains(@class, 'fc-item__container')]/a") %>%
html_attr(name = "href")
# inspect
links
## [1] "https://www.theguardian.com/books/2022/nov/02/irelands-call-navigating-brexit-by-stephen-collins-review-how-dublin-got-brussels-on-side"
## [2] "https://www.theguardian.com/world/2022/aug/19/world-leaders-dancefloor-videos-sanna-marin"
## [3] "https://www.theguardian.com/world/2022/jun/07/no-regrets-over-handling-of-vladimir-putin-says-angela-merkel"
## [4] "https://www.theguardian.com/world/2022/jun/02/germany-dependence-russian-energy-gas-oil-nord-stream"
## [5] "https://www.theguardian.com/world/2022/may/20/chat-group-leak-reveals-far-right-fantasies-of-germany-afd"
## [6] "https://www.theguardian.com/commentisfree/2022/apr/17/observer-view-french-elections"
## [7] "https://www.theguardian.com/commentisfree/2022/apr/10/germany-role-against-delusional-putin"
## [8] "https://www.theguardian.com/world/2022/mar/05/germany-angela-merkel-power-to-vladimir-putin-russia"
## [9] "https://www.theguardian.com/commentisfree/2021/dec/26/nothing-like-a-dame-trump-tops-2021s-cast-of-clowns-and-baddies"
## [10] "https://www.theguardian.com/world/2021/dec/08/olaf-scholz-elected-succeed-angela-merkel-german-chancellor"
## [11] "https://www.theguardian.com/world/2021/dec/08/after-16-years-at-the-top-of-german-politics-what-now-for-angela-merkel"
## [12] "https://www.theguardian.com/world/2021/dec/08/olaf-scholz-to-be-voted-in-as-german-chancellor-as-merkel-era-ends"
## [13] "https://www.theguardian.com/world/2021/dec/07/new-faces-policies-and-accents-germanys-next-coalition"
## [14] "https://www.theguardian.com/world/2021/dec/02/angela-merkel-bows-out-to-the-sound-of-beethoven-and-an-east-german-punk-hit"
## [15] "https://www.theguardian.com/world/2021/dec/02/germany-could-make-covid-vaccination-mandatory-says-merkel"
## [16] "https://www.theguardian.com/world/2021/dec/02/angela-merkel-to-bow-out-with-ceremony-live-on-german-tv"
## [17] "https://www.theguardian.com/world/2021/dec/01/long-reigns-often-leave-long-shadows-europeans-on-angela-merkel"
## [18] "https://www.theguardian.com/world/2021/nov/30/slugs-angela-merkel-blair-barroso-prodi-on-germany-leader"
## [19] "https://www.theguardian.com/world/2021/nov/29/angela-merkel-punk-pick-for-leaving-ceremony-raises-eyebrows"
## [20] "https://www.theguardian.com/world/2021/nov/28/a-new-german-era-dawns-but-collisions-lie-in-wait-for-coalition"
Now, links
contains a list of 20 hyperlinks to single
articles tagged with Angela Merkel.
But stop! There is not only one page of links to tagged articles. If you have a look on the page in your browser, the tag overview page has several more than 60 sub pages, accessible via a paging navigator at the bottom. By clicking on the second page, we see a different URL-structure, which now contains a link to a specific paging number. We can use that format to create links to all sub pages by combining the base URL with the page numbers.
page_numbers <- 1:3
base_url <- "https://www.theguardian.com/world/angela-merkel?page="
paging_urls <- paste0(base_url, page_numbers)
# inspect
paging_urls
## [1] "https://www.theguardian.com/world/angela-merkel?page=1"
## [2] "https://www.theguardian.com/world/angela-merkel?page=2"
## [3] "https://www.theguardian.com/world/angela-merkel?page=3"
Now we can iterate over all URLs of tag overview pages, to collect more/all links to articles tagged with Angela Merkel. We iterate with a for-loop over all URLs and append results from each single URL to a vector of all links.
all_links <- NULL
for (url in paging_urls) {
# download and parse single overview page
pjs_session$go(url)
rendered_source <- pjs_session$getSource()
html_document <- read_html(rendered_source)
# extract links to articles
links <- html_document %>%
html_nodes(xpath = "//div[contains(@class, 'fc-item__container')]/a") %>%
html_attr(name = "href")
# append links to vector of all links
all_links <- c(all_links, links)
}
. |
https://www.theguardian.com/books/2022/nov/02/irelands-call-navigating-brexit-by-stephen-collins-review-how-dublin-got-brussels-on-side |
https://www.theguardian.com/world/2022/aug/19/world-leaders-dancefloor-videos-sanna-marin |
https://www.theguardian.com/world/2022/jun/07/no-regrets-over-handling-of-vladimir-putin-says-angela-merkel |
An effective way of programming is to encapsulate repeatedly used code in a specific function. This function then can be called with specific parameters, process something and return a result. We use this here, to encapsulate the downloading and parsing of a Guardian article given a specific URL. The code is the same as in our exercise 1 above, only that we combine the extracted texts and metadata in a data.frame and wrap the entire process in a function-block.
scrape_guardian_article <- function(url) {
# start PhantomJS
pjs_session$go(url)
rendered_source <- pjs_session$getSource()
# read raw html
html_document <- read_html(rendered_source)
# extract title
title <- html_document %>%
rvest::html_node("h1") %>%
rvest::html_text(trim = T)
# extract text
text <- html_document %>%
rvest::html_node("p") %>%
rvest::html_text(trim = T)
# extract date
date <- url %>%
stringr::str_replace_all(".*([0-9]{4,4}/[a-z]{3,4}/[0-9]{1,2}).*", "\\1")
# generate data frame from results
article <- data.frame(
url = url,
date = date,
title = title,
body = text
)
return(article)
}
Now we can use that function scrape_guardian_article
in
any other part of our script. For instance, we can loop over each of our
collected links. We use a running variable i, taking values from 1 to
length(all_links)
to access the single links in
all_links
and write some progress output.
# create container for loop output
all_articles <- data.frame()
# loop over links
for (i in 1:length(all_links)) {
# print progress (optional)
#cat("Downloading", i, "of", length(all_links), "URL:", all_links[i], "\n")
# scrape website
article <- scrape_guardian_article(all_links[i])
# append current article data.frame to the data.frame of all articles
all_articles <- rbind(all_articles, article)
}
url | date | title | body |
https://www.theguardian.com/books/2022/nov/02/irelands-call-navigating-brexit-by-stephen-collins-review-how-dublin-got-brussels-on-side | 2022/nov/02 | Ireland’s Call: Navigating Brexit by Stephen Collins review – how Dublin got Brussels on side | Though a forensic study of the Brexit negotiations from an Irish perspective might sound dry, it is anything but |
https://www.theguardian.com/world/2022/aug/19/world-leaders-dancefloor-videos-sanna-marin | 2022/aug/19 | Dancing in the limelight: when world leaders take to the dancefloor | Sanna Marin follows in a long line of politicians filmed dancing, some more enthusiastically than others |
https://www.theguardian.com/world/2022/jun/07/no-regrets-over-handling-of-vladimir-putin-says-angela-merkel | 2022/jun/07 | No regrets over handling of Vladimir Putin, says Angela Merkel | Former German chancellor claims her opposition to Ukraine’s Nato membership helped country |
If you perform the web scraping on your own machine, you can now save
the table generated above on your machine using the code below. The code
chunk assumes that you have a folder called data
in your
current working directory
write.table(all_articles, here::here("data", "all_articles.txt"), sep = "\t")
The last command write the extracted articles to a tab-separated file in the data directory on your machine for any later use.
EXERCISE TIME!
`
https://www.theaustralian.com.au
,
https://www.nytimes.com
, or
https://www.spiegel.de
. For this, investigate the URL
patterns of the page and look into the source code with the `inspect
element’ functionality of your browser to find appropriate XPATH
expressions.`
Schweinberger, Martin. 2022. Web Crawling and Scraping using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/webcrawling.html (Version edition = 2022.11.15).
@manual{schweinberger2022webc,
author = {Schweinberger, Martin},
title = {Web Crawling and Scraping using R},
note = {https://ladal.edu.au/webcrawling.html},
year = {2022},
organization = "The University of Queensland, School of Languages and Cultures},
address = {Brisbane},
edition = {2022.11.15}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] webdriver_1.0.6 flextable_0.8.2 readtext_0.81 rvest_1.0.3
## [5] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4
## [9] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8 ggplot2_3.3.6
## [13] tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.2 lubridate_1.8.0 showimage_1.0.0
## [4] httr_1.4.4 tools_4.2.1 backports_1.4.1
## [7] bslib_0.4.0 utf8_1.2.2 R6_2.5.1
## [10] DBI_1.1.3 colorspace_2.0-3 withr_2.5.0
## [13] tidyselect_1.1.2 processx_3.7.0 curl_4.3.2
## [16] compiler_4.2.1 cli_3.3.0 xml2_1.3.3
## [19] officer_0.4.4 sass_0.4.2 scales_1.2.1
## [22] callr_3.7.2 systemfonts_1.0.4 digest_0.6.29
## [25] rmarkdown_2.16 katex_1.4.0 base64enc_0.1-3
## [28] pkgconfig_2.0.3 htmltools_0.5.3 dbplyr_2.2.1
## [31] fastmap_1.1.0 highr_0.9 rlang_1.0.4
## [34] readxl_1.4.1 rstudioapi_0.14 jquerylib_0.1.4
## [37] generics_0.1.3 jsonlite_1.8.0 zip_2.2.0
## [40] googlesheets4_1.0.1 magrittr_2.0.3 Rcpp_1.0.9
## [43] munsell_0.5.0 fansi_1.0.3 gdtools_0.2.4
## [46] lifecycle_1.0.1 stringi_1.7.8 yaml_2.3.5
## [49] debugme_1.1.0 grid_4.2.1 crayon_1.5.1
## [52] haven_2.5.1 hms_1.1.2 knitr_1.40
## [55] ps_1.7.1 klippy_0.0.0.9500 pillar_1.8.1
## [58] uuid_1.1-0 reprex_2.0.2 xslt_1.4.3
## [61] glue_1.6.2 evaluate_0.16 V8_4.2.1
## [64] data.table_1.14.2 modelr_0.1.9 vctrs_0.4.1
## [67] png_0.1-7 tzdb_0.3.0 selectr_0.4-2
## [70] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
## [73] cachem_1.0.6 xfun_0.32 broom_1.0.0
## [76] equatags_0.2.0 googledrive_2.0.0 gargle_1.2.0
## [79] ellipsis_0.3.2