The data we work with comes in many formats and types. Therefore, this tutorial shows how you can load and save different types of data when working with R and we will have a brief look at how to generate data in R.
This tutorial is aimed at beginners with the aim of showcasing how to load and save different type of data and data structures in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify how to load and save the most common types of data in R.
To be able to follow this tutorial, we suggest you check out and
familiarize yourself with the content of the following R
Basics tutorials:
Click here1 to
download the entire R Notebook for this tutorial.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).
# install packages
install.packages("xlsx")
install.packages("dplyr")
install.packages("stringr")
install.packages("tidyr")
install.packages("flextable")
install.packages("openxlsx")
install.packages("here")
install.packages("faux")
install.packages("data.tree")
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")
Now that we have installed the packages, we can activate them as shown below.
# load packages
library(dplyr)
library(stringr)
library(tidyr)
library(flextable)
library(xlsx)
library(openxlsx)
library(here)
library(data.tree)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go.
NOTE
This tutorial assumes that you will be loading data
from your own computer, as is often the case.
This tutorial assumes that you have a designated subfolder named
data
within the directory where your R project (the Rproj
file) is located. It is assumed that your data sets are stored in this
data subfolder. Additionally, we provide guidance on how to load
multiple text files into R, a common scenario when working with corpora.
These multiple texts are expected to be situated within a folder named
textcorpus
, which is located within the data
subfolder.
If you have a different setup, you will need to adjust the path to
the data in order for the tutorial to function correctly on your own
computer. It is worth mentioning that the here
function is
utilized to create paths that originate from the directory where the
Rproj is located.
In other words, your directory should have the structure as shown below.
levelName
1 myproject
2 ¦--Rproj
3 ¦--load.Rmd
4 °--data
5 ¦--testdat.csv
6 ¦--testdat2.csv
7 ¦--testdat.xlsx
8 ¦--testdat.txt
9 ¦--testdat.rda
10 ¦--english.rda
11 °--testcorpus
12 ¦--linguistics01.txt
13 ¦--linguistics02.txt
14 ¦--linguistics03.txt
15 ¦--linguistics04.txt
16 ¦--linguistics05.txt
17 ¦--linguistics06.txt
18 °--linguistics07.txt
The data used in this tutorial can be downloaded using the links below:
In R, there are multiple functions available for reading comma-separated (csv) and other Excel files. Once we cover these functions, we will briefly explore the process of generating data from scratch, without relying on loading pre-existing data files.
A common data type when working with tabulated data are comma
separated files (csv). To load such files, we can use the
read.csv
function as shown below.
# load csv file
datcsv <- read.csv(here::here("data", "testdat.csv"),
# indicate the data has column names
header=TRUE)
# inspect first 6 rows using the head() function
head(datcsv)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
The data is not spectacular and consist of a table with 2 columns
(Variable1
, and Variable2
).
Sometimes, csv files are actually not comma-separated but use a
semi-colon as a separator. In such cases, we can use the
read.delim
function to load the csv and specify that the
separator (sep
) is “;”.
# load csv with ; as the separator
datcsv2 <- read.delim(here::here("data", "testdat2.csv"),
# define separator
sep = ";",
# indicate the the data has column names
header = TRUE)
# inspect data
head(datcsv2)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
To save a data set as a csv on your computer (here it is saved within
the data
folder within the folder where the Rproj is
located).
# save data as a csv without row names
write.csv(datcsv, here::here("data", "testdat.csv"), row.names = F)
To load excel data, you can use the read.xlsx
function
from the openxlsx
package. We have activated the
openxlsx
package in the session preparation so we do not
need to activate it again here. If you get an error message telling you
that R did not find the read.xlsx
function, you need to
activate the openxlsx
package by running the
library(openxlsx)
.
# load data
datxlsx <- openxlsx::read.xlsx(
# define path where data is stored
here::here("data", "testdat.xlsx"),
# define spreadsheet to load
sheet = 1)
# inspect first 6 rows using the head() function
head(datxlsx)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
To save xlsx files, we can use the write.xlsx
from the
openxlsx
package as shown below.
write.xlsx(
# define object to be stored
datxlsx,
# define path where data should be stored
here::here("data", "testdat.xlsx"))
If the data is tabular and stored as a txt-file, there are various
functions to read in the data. The most common functions are
read.delim
and read.table
. The read.delim
function is very flexible and allows you to specify the separator and
inform R that the first row contains column headers rather than data
points (if the data does not contain column headers, then you do not
need to specify header = T
because header = F
is the default).
# load tab txt 1
dattxt <- read.delim(here::here("data", "testdat.txt"),
sep = "\t", header = TRUE)
# inspect data
head(dattxt)
The read.table
function is very similar and can also be
used to load various types of tabulated data. Again, we let R know that
the first row contains column headers rather than data points.
# load tab txt
dattxt2 <- read.table(here::here("data", "testdat.txt"), header = TRUE)
# inspect
head(dattxt2)
To save tabulated txt files, we use the write.table
function. In the write.table
function we define the
separator (in this case we write a tab-separated file) and inform R to
not add row names (i.e, that R should not number rows and store this
information in a separate column).
# save txt
write.table(dattxt, here::here("data", "testdat.txt"), sep = "\t", row.names = F)
Unstructured data (most commonly data representing raw text) is also very common - particularly when working with corpus data.
To load text data into R (here in the form of a txt file), we can use
the scan
function. Reading in texts using the
scan
function will result in loading vectors of stings
where each string represents a separate word.
testtxt <- scan(here::here("data", "english.txt"), what = "char")
# inspect
testtxt
## [1] "Linguistics" "is" "the" "scientific" "study"
## [6] "of" "language" "and" "it" "involves"
## [11] "the" "analysis" "of" "language" "form,"
## [16] "language" "meaning," "and" "language" "in"
## [21] "context."
In contract, the readLines
function will read in
complete lines and result in a vector of strings representing lines (if
the entire text is in 1 line, the the entire text will be loaded as a
single string).
testtxt2 <- readLines(here::here("data", "english.txt"))
# inspect
testtxt2
## [1] "Linguistics is the scientific study of language and it involves the analysis of language form, language meaning, and language in context. "
To save text data, we can use the writeLines
function as
shown below.
writeLines(text2, here::here("data", "english.txt"))
When dealing with text data, it is quite common to encounter
scenarios where we need to load multiple files containing texts. In such
cases, we typically begin by storing the file locations in an object
(referred to as fls
in this context) and then proceed to
load the files using the sapply
function, which allows for
looping. Within the sapply
function, we have the option to
utilize either scan
or writeLines
for reading
the text. In the example below, we employ scan
and
subsequently merge the individual elements into a single text using the
paste
function. The output demonstrates the successful
loading of 7 txt files from the testcorpus
located within
the data
folder.
# extract file paths
fls <- list.files(here::here("data", "testcorpus"), pattern = "txt", full.names = T)
# load files
txts <- sapply(fls, function(x){
x <- scan(x, what = "char") %>%
paste0( collapse = " ")
})
# inspect
str(txts)
## Named chr [1:7] "Linguistics is the scientific study of language. It involves analysing language form language meaning and langu"| __truncated__ ...
## - attr(*, "names")= chr [1:7] "F:/data recovery/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics01.txt" "F:/data recovery/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics02.txt" "F:/data recovery/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics03.txt" "F:/data recovery/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics04.txt" ...
To save multiple txt files, we follow a similar procedure and first
determine the paths that define where R will store the files and then
loop over the files and store them in the testcorpus
folder.
# define where to save each file
outs <- file.path(paste(here::here(), "/", "data/testcorpus", "/", "text", 1:7, ".txt", sep = ""))
# save the files
lapply(seq_along(txts), function(i)
writeLines(txts[[i]],
con = outs[i]))
When working withR in RStudio, it makes sense to save data as R data
objects as this requires minimal storage space and allows to load and
save data very quickly. R data objects can have any format (structured,
unstructured, lists, etc.). Here, we use the readRDS
function to load R data objects (which can represent any form or type of
data).
# load data
rdadat <- readRDS(here::here("data", "testdat.rda"))
# inspect
head(rdadat)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
To save R data objects, we use the saveRDS
function as
shown below.
saveRDS(rdadat, file = here::here("data", "testdat.rda"))
You can load all types of data discussed above from the web. the only
thing you need to do is to change the path. Instead of defining a path
on your own computer, simply replace it with a url with thin the
url
function nd the additional argument
"rb"
.
So loading the testdat.rda
from the LADAL github data
repo would require the following path specification:
url("https://slcladal.github.io/data/testdat.rda", "rb")
compared to the data repo in the current Rproj:
here::here("data", "testdat.rda")
See below how you can load, e.g., an rda
object from the
LADAL data repo on GitHub.
webdat <- base::readRDS(url("https://slcladal.github.io/data/testdat.rda", "rb"))
# inspect
head(webdat)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
We can then store this data as shown in the sections above.
In this section, we will briefly have a look at how to generate data in R.
To create a simple data frame, we can simply generate the columns and
then bind them together using the data.frame
function as
shown below.
# create a data frame from scratch
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
l1 <- c("english", "german", "english")
mydat <- data.frame(age, gender, l1)
# inspect
head(mydat)
## age gender l1
## 1 25 male english
## 2 30 female german
## 3 56 male english
You can also generate more complex data sets where columns or variables correlate with each other. Below, we will generate a data set with 4 correlated variables: Proficiency (the proficiency of a speaker), Abroad (whether or not subjects have been abroad), University (if they went to a standard or excellent university), and PluralError (if they produced a number marking error in a test sentence).
We start by setting seed so the generated data will be the same each time we generate the data.
set.seed(678)
Next, we create a correlation matrix, Here, we will create 4 variables and for each of these variables we have to determine how strongly each variable should be correlated with each other variable. The diagonal values are 1 as each variable correlates perfectly with itself.
cmat <- c(1.00, 0.05, 0.05, -0.5,
0.05, 1.00, 0.05, -0.3,
0.05, 0.05, 1.00, -0.1,
-0.50, -0.30, -0.10, 1.0)
Next, we generate the data using the rnorm_multi
function. In this function, we need to specify:
n
)vars
)mu
)sd
)varnames
).If all variables should have the same mean, then we only need to provide a singe value but we need to provide 4 values, if we want the variables to have different means).
dat <- faux::rnorm_multi(n = 400, vars = 4, mu = 1, sd = 1, cmat,
varnames = c("Proficiency", "Abroad", "University", "PluralError"))
# inspect
head(dat)
## Proficiency Abroad University PluralError
## 1 1.2214098 1.2103764 1.1571743 -0.09080666
## 2 0.3439251 0.4244035 2.2630978 2.19437567
## 3 0.9070220 0.7618138 0.9030697 1.62879140
## 4 2.3410885 -0.5503251 2.0893112 -0.45122521
## 5 3.0568993 1.4600793 0.7689547 -1.05360914
## 6 2.7434079 1.5160429 1.2232027 1.51467370
If you want to generate numeric data, then this would be all you need to do. If you want to generate categorical variables, however, we need to convert these numeric values into factors. In the example below, we convert all values higher than 1 (the mean) into one level, and all other values into a second level.
# modify data
dat <- dat %>%
dplyr::mutate(Proficiency = ifelse(Proficiency > 1, "Advanced", "Intermediate"),
Abroad = ifelse(Abroad > 1, "Abroad", "Home"),
University = ifelse(University > 1, "Excellent", "Standard"),
PluralError = ifelse(PluralError > 1, "Error", "Correct"))
# inspect
head(dat)
## Proficiency Abroad University PluralError
## 1 Advanced Abroad Excellent Correct
## 2 Intermediate Home Excellent Error
## 3 Intermediate Home Standard Error
## 4 Advanced Home Excellent Correct
## 5 Advanced Abroad Standard Correct
## 6 Advanced Abroad Excellent Error
And again, we could then save this data on our computer as shown in the sections above. For instance, we could save it as an MS Excel file as shown below.
write.xlsx(dat, here::here("data", "dat.xlsx"))
You may also want to create textual data (e.g., to create sample sentences or short test texts). Thus, we will briefly focus on how to create textual data in R.
The easiest way to generate text data is to simply create strings and combine them as shown below.
text <- c("This is an example sentence.", "This is a second example sentence")
# inspect
text
## [1] "This is an example sentence." "This is a second example sentence"
If you need to generate many sentences that have a standard format, you can make use of the paste function.
num <- 1:4
start <- "This is sentence number "
end <- "."
texts <- paste(start, num, end, sep = "")
# inspect
texts
## [1] "This is sentence number 1." "This is sentence number 2."
## [3] "This is sentence number 3." "This is sentence number 4."
Or, you can combine these text snippets into a single string.
onetext <- paste(start, num, end, sep = "", collapse = " ")
# inspect
onetext
## [1] "This is sentence number 1. This is sentence number 2. This is sentence number 3. This is sentence number 4."
The text can then be saved using the writeLines
function
as shown below.
writeLines(onetext, here::here("data", "onetext.txt"))
This is all for this tutorial. We hope it is useful and that you have a better idea about how to load, save and generate data now.
Schweinberger, Martin. 2022. Loading, saving, and generating data in R. Brisbane: The University of Queensland. URL: https://ladal.edu.au/load.html (Version 2022.11.08).
@manual{schweinberger2022loadr,
author = {Schweinberger, Martin},
title = {Loading, saving, and generating data in R},
note = {https://ladal.edu.au/load.html},
year = {2022},
organization = "The University of Queensland, School of Languages and Cultures},
address = {Brisbane},
edition = {2022.11.08}
}
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.tree_1.0.0 here_1.0.1 openxlsx_4.2.5.2 xlsx_0.6.5
## [5] flextable_0.9.1 tidyr_1.3.0 stringr_1.5.0 dplyr_1.1.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 xlsxjars_0.6.1 assertthat_0.2.1
## [4] rprojroot_2.0.3 digest_0.6.31 utf8_1.2.3
## [7] mime_0.12 R6_2.5.1 evaluate_0.21
## [10] ggplot2_3.4.2 highr_0.10 pillar_1.9.0
## [13] gdtools_0.3.3 rlang_1.1.1 curl_5.0.0
## [16] uuid_1.1-0 rstudioapi_0.14 data.table_1.14.8
## [19] jquerylib_0.1.4 klippy_0.0.0.9500 rmarkdown_2.21
## [22] textshaping_0.3.6 munsell_0.5.0 shiny_1.7.4
## [25] compiler_4.2.2 httpuv_1.6.11 xfun_0.39
## [28] pkgconfig_2.0.3 askpass_1.1 systemfonts_1.0.4
## [31] gfonts_0.2.0 htmltools_0.5.5 faux_1.2.1
## [34] openssl_2.0.6 tidyselect_1.2.0 tibble_3.2.1
## [37] fontBitstreamVera_0.1.1 httpcode_0.3.0 fansi_1.0.4
## [40] crayon_1.5.2 later_1.3.1 crul_1.4.0
## [43] grid_4.2.2 gtable_0.3.3 jsonlite_1.8.4
## [46] xtable_1.8-4 lifecycle_1.0.3 magrittr_2.0.3
## [49] scales_1.2.1 zip_2.3.0 cli_3.6.1
## [52] stringi_1.7.12 cachem_1.0.8 promises_1.2.0.1
## [55] xml2_1.3.4 bslib_0.4.2 ellipsis_0.3.2
## [58] ragg_1.2.5 generics_0.1.3 vctrs_0.6.2
## [61] tools_4.2.2 glue_1.6.2 officer_0.6.2
## [64] fontquiver_0.2.1 purrr_1.0.1 fastmap_1.1.1
## [67] yaml_2.3.7 colorspace_2.1-0 fontLiberation_0.1.0
## [70] rJava_1.0-6 knitr_1.43 sass_0.4.6
If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎