This tutorial shows how to extract text from one or more pdf-files using optical character recognition (OCR) and then saving the text(s) in txt-files on your computer.
This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to convert pdfs into txt files using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with extracting texts from pdfs.
This tutorial uses two packages for OCR and text extraction: pdftools
which is very fast and is very recommendable when dealing with very legible and clean pdf-files (such as pdf-files of websites and books that were rendered directly from, e.g., word-documents, and the tesseract
package which is slower but works much better when the data is unclean and represents, e.g., scans of books, faxes, or reports. In addition, we show how we can combine OCR with spell-checking via the hunspell
package (see here for more information) when using the tesseract
package (but this an also be done for any other textual data in R).
The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.
Click this link to open an interactive version of this tutorial on MyBinder.org.
This interactive Jupyter notebook allows you to execute code yourself and you can also change and edit the notebook, e.g. you can change code and upload your own data.
Preparation and session set up
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# install packages
install.packages("pdftools")
install.packages("tesseract")
install.packages("tidyverse")
install.packages("here")
install.packages("hunspell")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")
Next we activate the packages.
# activate packages
library(pdftools)
library(tesseract)
library(tidyverse)
library(here)
library(hunspell)
# set tesseract engine
eng <- tesseract("eng")
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed RStudio and have also initiated the session by executing the code shown above, you are good to go.
How to use the RNotebook for this tutorial
To follow this tutorial interactively (by using the RNotebook - or Rmd for short), follow the instructions listed below.
Data and folder set up
R and RStudio set up
File
in the upper left corner of the R Studio interfaceNew Project...
Existing Directory
Open
Files
above the lower right panelpdf2txt.Rmd
Knit
above the upper left panel in RStudio.The pdf we will convert is a pdf of the Wikipedia article about corpus linguistics. The first part of that pdf is shown below.
Given that the pdf contains tables, urls, reference, etc., the text that we will extract from the pdf will be rather messy - cleaning the content of the text would be another matter (it would be data processing rather than extraction) and we will thus only focus on the conversion process here and not focus on the data cleaning and processing aspect.
We begin the extraction by defining a path to the pdf. Once we have defined a path, i.e. where R is supposed to look for that file, we continue by extracting the text from the pdf.
# you can use an url or a path that leads to a pdf document
pdf_path <- "https://slcladal.github.io/data/PDFs/pdf0.pdf"
# extract text
txt_output <- pdftools::pdf_text(pdf_path) %>%
paste0(collapse = " ") %>%
paste0(collapse = " ") %>%
stringr::str_squish()
. |
Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves,[1] to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.[2] The text-corpus method is a digestive approach that derives a set of abstract rules that govern a natural language from texts in that language, and explores how that language relates to other languages. Originally derived manually, cor |
To convert many pdf-files, we write a function that preforms the conversion for many documents.
convertpdf2txt <- function(dirpath){
files <- list.files(dirpath, full.names = T)
x <- sapply(files, function(x){
x <- pdftools::pdf_text(x) %>%
paste0(collapse = " ") %>%
stringr::str_squish()
return(x)
})
}
We can now apply the function to the folder in which we have stored the pdf-files we want to convert. In the present case, I have stored 4 pdf-files of Wikipedia articles in a folder called PDFs which is located in my data folder as described in the section above which detailed how to set up the Rproject folder on your computer). The output is a vector with the texts of the pdf-files.
# apply function
txts <- convertpdf2txt(here::here("data", "PDFs/"))
. |
Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves,[1] to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.[2] The text-corpus method is a digestive approach that derives a set of abstract rules that govern a natural language from texts in that language, and explores how that language relates to other languages. Originally derived manually, cor |
Language - Wikipedia https://en.wikipedia.org/wiki/Language Language A language is a structured system of communication. Language, in a broader sense, is the method of communication that involves the use of – particularly human – languages.[1][2][3] The scientific study of language is called linguistics. Questions concerning the philosophy of language, such as whether words can represent experience, have been debated at least since Gorgias and Plato in ancient Greece. Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have held that it originated from rational and logical thought. 20th-century philosophers such as Wittgenstein argued that philosophy is really the study of language. Major figures in linguistics include Ferdinand de Saussure and Noam Chomsky. Estimates of the number of human languages in the world vary between 5,000 and 7,000. However, any precise estimate depends on the arbitrary distinction (dichotomy) between languages |
Natural language processing - Wikipedia https://en.wikipedia.org/wiki/Natural_language_processing Natural language processing Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. Contents History Rule-based vs. statistical NLP Major evaluations and tasks Syntax Semantics An automated online assistant Discourse providing customer service on a Speech web page, an example of an Dialogue application where natural Cognition language processing is a major component.[1] See also References Further reading History The history of natural language processing (NLP) generally started in the 195 |
Computational linguistics - Wikipedia https://en.wikipedia.org/wiki/Computational_linguistics Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions. Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. Today, computational linguists often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the involvement of linguists, computer scientists, experts in artificial intelligence, mathematicians, logicians, philosophers, cognitive scientists, cognitive psychologists, psycholinguists, anthropologists and neuroscientists, among |
The table above shows the first 1000 characters of the texts extracted from 4 pdf-files of Wikipedia articles associated with language technology (corpus linguistics, linguistics, natural language processing, and computational linguistics).
To save the texts in txt-files on your disc, you can simply replace the predefined location (the data folder of your Rproject located by the string here::here("data")
with the folder where you want to store the txt-files and then execute the code below. Also, we will name the texts (or the txt-files if you like) as pdftext plus their index number.
# add names to txt files
names(txts) <- paste0(here::here("data","pdftext"), 1:length(txts), sep = "")
# save result to disc
lapply(seq_along(txts), function(i)writeLines(text = unlist(txts[i]),
con = paste(names(txts)[i],".txt", sep = "")))
If you check the data folder in your Rproject folder, you should find 4 files called pdftext1, pdftext2, pdftext3, pdftext4.
In this section, we use the tesseract
package for OCR (see here for more information and a more thorough tutorial on usign the tesseract
package). The tesseract
package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.
We start by creating a vector of paths to the pdf-files that we want to extract the text from.
fls <- list.files(here::here("data/PDFs"), full.names = T)
# load
ocrs <- sapply(fls, function(x){
# store name
nm <- stringr::str_replace_all(x, ".*/(.*?).pdf", "\\1")
# perform ocr
x <- tesseract::ocr(x, engine = eng) %>%
paste0(collapse = " ")
})
## Converting page 1 to pdf0_1.png... done!
## Converting page 2 to pdf0_2.png... done!
## Converting page 1 to pdf1_1.png... done!
## Converting page 2 to pdf1_2.png... done!
## Converting page 3 to pdf1_3.png... done!
## Converting page 4 to pdf1_4.png... done!
## Converting page 5 to pdf1_5.png... done!
## Converting page 6 to pdf1_6.png... done!
## Converting page 7 to pdf1_7.png... done!
## Converting page 8 to pdf1_8.png... done!
## Converting page 9 to pdf1_9.png... done!
## Converting page 10 to pdf1_10.png... done!
## Converting page 11 to pdf1_11.png... done!
## Converting page 1 to pdf2_1.png... done!
## Converting page 2 to pdf2_2.png... done!
## Converting page 3 to pdf2_3.png... done!
## Converting page 4 to pdf2_4.png... done!
## Converting page 1 to pdf3_1.png... done!
## Converting page 2 to pdf3_2.png... done!
## Converting page 3 to pdf3_3.png... done!
## Converting page 4 to pdf3_4.png... done!
. |
Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics |
Language - Wikipedia https://en.wikipedia.org/wiki/Language |
Natural language processing - Wikipedia https://en.wikipedia.org/wiki/Natural_ language processing |
Computational linguistics - Wikipedia https://en.wikipedia.org/wiki/Computational_ linguistics |
Although the results already look very promising, we want to see how we can combine automated spell-checking/correction with OCR as this is necessary when dealing with less pristine documents.
In a first step, we write a function that loops over each text and checks which words occur in a English language dictionary (which we do not specify as it is the default). This spell checking makes use of the the hunspell
package (see here for more information). Hunspell is based on MySpell and is backward-compatible with MySpell and aspell dictionaries. This means that we can import and/or make use of many different language dictionaries and it is quite likely that the dictionaries for other languages may already available on your system!
# create token list
tokens_ocr <- sapply(ocrs, function(x){
x <- hunspell::hunspell_parse(x)
})
. |
c("Corpus", "linguistics", "Wikipedia", "https", "en", "wikipedia", "org", "wiki", "Corpus", "linguistics", "WIKIPEDIA", "e", "e", "e", "Corpus", "linguistics", "Corpus", "linguistics", "is", "the", "study", "of", "language", "as", "expressed", "in", "corpora", "samples", "of", "real", "world", "text", "Corpus", "linguistics", "proposes", "that", "reliable", "language", "analysis", "is", "more", "feasible", "with", "corpora", "collected", "in", "the", "field", "in", "its", "natural", "context", "realia", |
c("Language", "Wikipedia", "https", "en", "wikipedia", "org", "wiki", "Language", "WIKIPEDIA", "Language", "A", "language", "is", "a", "structured", "system", "of", "communication", "Language", "in", "a", "broader", "sense", "is", "the", "method", "of", "communication", "that", "involves", "the", "use", "of", "particularly", "human", "languages", "J", "II", "The", "scientific", "study", "of", "language", "is", "called", "linguistics", "Questions", "concerning", "the", "philosophy", "of", "language", |
c("Natural", "language", "processing", "Wikipedia", "https", "en", "wikipedia", "org", "wiki", "Natural", "language", "processing", "WIKIPEDIA", "e", "Natural", "language", "processing", "Natural", "language", "processing", "NLP", "is", "a", "subfield", "of", "linguistics", "computer", "science", "information", "engineering", "and", "artificial", "intelligence", "concerned", "with", "the", "interactions", "ass", "between", "computers", "and", "human", "natural", "languages", "in", "particular", "how", |
c("Computational", "linguistics", "Wikipedia", "https", "en", "wikipedia", "org", "wiki", "Computational", "linguistics", "WIKIPEDIA", "e", "e", "e", "e", "Computational", "linguistics", "Computational", "linguistics", "is", "an", "interdisciplinary", "field", "concerned", "with", "the", "statistical", "or", "rule", "based", "modeling", "of", "natural", "language", "from", "a", "computational", "perspective", "as", "well", "as", "the", "study", "of", "appropriate", "computational", "approaches", |
In a next step, we can correct errors resulting from the OCR process, correct the errors and paste th texts back together (which is all done by the code chunk below).
# clean
clean_ocrtext <- sapply(tokens_ocr, function(x){
correct <- hunspell::hunspell_check(x)
x <- ifelse(correct == F,
x[hunspell::hunspell_check(x)],
x)
x <- paste0(x, collapse = " ")
})
. |
Corpus linguistics Wikipedia en en wiki org wiki Corpus linguistics WIKIPEDIA e e e Corpus linguistics Corpus linguistics is the study of language as expressed in corpora samples of real world text Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context minimal and with minimal experimental interference The field of corpus linguistics features divergent views about the value of corpus annotation These views range from John minimal Sinclair who advocates minimal annotation so texts speak for themselves to the Survey of English Usage team University College London who advocate annotation as allowing greater linguistic understanding through rigorous recording The text corpus method is a digestive approach that derives a set of abstract rules that govern a natural language from texts in that language and explores how that language relates to other languages Originally derived manually corpora now are automatica |
Language Wikipedia en en wiki org wiki Language WIKIPEDIA Language A language is a structured system of communication Language in a broader sense is the method of communication that involves the use of particularly human languages J II The scientific study of language is called linguistics Questions concerning the philosophy of language such as whether words can represent experience have been o debated at least since in and Plato in ancient Greece Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have held that it originated from rational and logical thought as century philosophers such as Wittgenstein argued that philosophy is really the study of language Major include An figures in linguistics include Ferdinand of Saussure and of Chomsky Lie Estimates of the number of human languages in the world vary between and However any precise estimate depends on the arbitrary distinction Natural Y is dichotomy between languages and dialect Natu |
Natural language processing Wikipedia en en wiki org wiki Natural language processing WIKIPEDIA e Natural language processing Natural language processing of is a science of linguistics computer science information engineering and artificial intelligence concerned with the interactions ass between computers and human natural languages in particular how to program computers to process and analyze large amounts of natural language data processing frequently involve speech Challenges in natural language processing frequently involve speech recognition natural language understanding and natural language generation non Contents am History i online Rule based vs statistical How may Hi I'm your automated online online assistant Major evaluations and tasks service How may page an f of Syntax Dialogue Semantics where An automated online assistant Discourse a providing customer service also Speech web page an example of an Dialogue application where natural Cognition language processing is a majo |
Computational linguistics Wikipedia en en wiki org wiki Computational linguistics WIKIPEDIA e e e e Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule based modeling of natural language from a computational perspective as well as the study of appropriate computational approaches to linguistic questions Traditionally computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language Today computational linguists often work as members of interdisciplinary teams which can include regular linguists experts in the target language and computer scientists In general computational linguistics draws upon the involvement of linguists computer scientists experts in artificial intelligence mathematicians logicians philosophers cognitive scientists cognitive psychologists among anthropologists and linguistics among others Computational linguis |
We have reached the end of this tutorial and we hope that the tutoral helps you in performing OCR on your own pdfs.
Schweinberger, Martin. 2023. Converting PDFs to txt files with R. Brisbane: The University of Queensland. url: https://ladal.edu.au/pdf2txt.html (Version 2023.02.09).
@manual{schweinberger2023pdf2txt,
author = {Schweinberger, Martin},
title = {Converting PDFs to txt files with R},
note = {https://ladal.edu.au/pdf2txt.html},
year = {2023},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2023.02.09}
}
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] hunspell_3.0.1 here_1.0.1 forcats_0.5.1 stringr_1.4.0
## [5] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
## [9] tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.2 tesseract_5.1.0
## [13] pdftools_3.3.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.3 sass_0.4.1 jsonlite_1.8.0
## [4] modelr_0.1.8 bslib_0.3.1 assertthat_0.2.1
## [7] askpass_1.1 highr_0.9 renv_0.15.4
## [10] googlesheets4_1.0.0 cellranger_1.1.0 yaml_2.3.5
## [13] qpdf_1.2.0 gdtools_0.2.4 pillar_1.7.0
## [16] backports_1.4.1 glue_1.6.2 uuid_1.1-0
## [19] digest_0.6.29 rvest_1.0.2 colorspace_2.0-3
## [22] htmltools_0.5.2 pkgconfig_2.0.3 broom_1.0.0
## [25] haven_2.5.0 scales_1.2.0 officer_0.4.3
## [28] tzdb_0.3.0 googledrive_2.0.0 generics_0.1.3
## [31] ellipsis_0.3.2 withr_2.5.0 klippy_0.0.0.9500
## [34] cli_3.3.0 magrittr_2.0.3 crayon_1.5.1
## [37] readxl_1.4.0 evaluate_0.15 fs_1.5.2
## [40] fansi_1.0.3 xml2_1.3.3 tools_4.1.2
## [43] data.table_1.14.2 hms_1.1.1 gargle_1.2.0
## [46] lifecycle_1.0.1 flextable_0.7.3 munsell_0.5.0
## [49] reprex_2.0.1 zip_2.2.0 compiler_4.1.2
## [52] jquerylib_0.1.4 systemfonts_1.0.4 rlang_1.0.4
## [55] grid_4.1.2 rappdirs_0.3.3 base64enc_0.1-3
## [58] rmarkdown_2.14 gtable_0.3.0 DBI_1.1.3
## [61] R6_2.5.1 lubridate_1.8.0 knitr_1.39
## [64] fastmap_1.1.0 utf8_1.2.2 rprojroot_2.0.3
## [67] stringi_1.7.8 Rcpp_1.0.8.3 vctrs_0.4.1
## [70] dbplyr_2.2.1 tidyselect_1.1.2 xfun_0.31