The data we work with comes in many formats and types. Therefore, this tutorial shows how you can load and save different types of data when working with R and we will have a brief look at how to generate data in R.
This tutorial is aimed at beginners with the aim of showcasing how to load and save different type of data and data structures in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify how to load and save the most common types of data in R.
The entire R Notebook for the tutorial can be downloaded here.
If you want to render the R Notebook on your machine, i.e. knitting the
document to html or a pdf, you need to make sure that you have R and
RStudio installed and you also need to download the bibliography
file and store it in the same folder where you store the
Rmd file.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).
# install packages
install.packages("xlsx")
install.packages("dplyr")
install.packages("stringr")
install.packages("tidyr")
install.packages("flextable")
install.packages("openxlsx")
install.packages("here")
install.packages("faux")
install.packages("data.tree")
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")
Now that we have installed the packages, we can activate them as shown below.
# load packages
library(dplyr)
library(stringr)
library(tidyr)
library(flextable)
library(xlsx)
library(openxlsx)
library(here)
library(data.tree)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go.
NOTE
As most of the time we load data from our own
computers, this tutorial also assumes that you load from your own
computer.
Specifically, we assume that you have a subfolder called
data
in which you have stored your data in the directory
where you have your R project (the Rproj file). We assume that the data
sets we load are located in that data subfolder. We also show you how
you can load multiple text files into R (which is common if you work
with corpora, e.g.). The multiple texts are located in a folder called
textcorpus
within the data subfolder.
If you have a separate set-up, you have to adapt the path to the
data for the tutorial to work on your own computer. (Also note that the
here
functions is used to create paths that start in the
directory where the Rproj is located.
In other words, your directory should have the structure as shown below.
levelName
1 myproject
2 ¦--Rproj
3 ¦--load.Rmd
4 °--data
5 ¦--testdat.csv
6 ¦--testdat2.csv
7 ¦--testdat.xlsx
8 ¦--testdat.txt
9 ¦--testdat.rda
10 ¦--english.rda
11 °--testcorpus
12 ¦--linguistics01.txt
13 ¦--linguistics02.txt
14 ¦--linguistics03.txt
15 ¦--linguistics04.txt
16 ¦--linguistics05.txt
17 ¦--linguistics06.txt
18 °--linguistics07.txt
The data used in this tutorial can be downloaded using the links below:
There are several different functions that allow us to read comma separated (csv) and other Excel files into R. After we go over these, we will have a brief glance at how to create data from scratch, i.e. not loading but generating data.
A common data type when working with tabulated data are comma
separated files (csv). To load such files, we can use the
read.csv
function as shown below.
datcsv <-read.csv(here::here("data", "testdat.csv"), header=TRUE)
# inspect
head(datcsv)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
The data is not spectacular and consist of a table with 2 columns
(Variable1
, and Variable2
).
Sometimes, csv files are actually not comma-separated but use a
semi-colon as a separator. In such cases, we can use the
read.delim
function to load the csv and specify that the
separator (sep
) is “;”.
# load csv with ;
datcsv2 <- read.delim(here::here("data", "testdat2.csv"),
sep = ";", header = TRUE)
# inspect data
head(datcsv2)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
To save a data set as a csv on your computer (here it is saved within
the data
folder within the folder where the Rproj is
located).
write.csv(datcsv, here::here("data", "testdat.csv"), row.names = F)
To load excel data, you can use the read.xlsx
function
from the openxlsx
package. We have activated the
openxlsx
package in the session preparation so we do not
need to activate it again here. If you get an error message telling you
that R did not find the read.xlsx
function, you need to
activate the openxlsx
package by running the
library(openxlsx)
.
# load data
datxlsx <- openxlsx::read.xlsx(here::here("data", "testdat.xlsx"), sheet = 1)
# inspect
head(datxlsx)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
To save xlsx files, we can use the write.xlsx
from the
openxlsx
package as shown below.
write.xlsx(datxlsx, here::here("data", "testdat.xlsx"))
If the data is tabular and stored as a txt-file, there are various
functions to read in the data. The most common functions are
read.delim
and read.table
. The read.delim
function is very flexible and allows you to specify the separator and
inform R that the first row contains column headers rather than data
points (if the data does not contain column headers, then you do not
need to specify header = T
because header = F
is the default).
# load tab txt 1
dattxt <- read.delim(here::here("data", "testdat.txt"),
sep = "\t", header = TRUE)
# inspect data
head(dattxt)
The read.table
function is very similar and can also be
used to load various types of tabulated data. Again, we let R know that
the first row contains column headers rather than data points.
# load tab txt
dattxt2 <- read.table(here::here("data", "testdat.txt"), header = TRUE)
# inspect
head(dattxt2)
To save tabulated txt files, we use the write.table
function. In the write.table
function we define the
separator (in this case we write a tab-separated file) and inform R to
not add row names (i.e, that R should not number rows and store this
information in a separate column).
# save txt
write.table(dattxt, here::here("data", "testdat.txt"), sep = "\t", row.names = F)
Unstructured data (most commonly data representing raw text) is also very common - particularly when working with corpus data.
To load text data into R (here in the form of a txt file), we can use
the scan
function. Reading in texts using the
scan
function will result in loading vectors of stings
where each string represents a separate word.
testtxt <- scan(here::here("data", "english.txt"), what = "char")
# inspect
testtxt
## [1] "Linguistics" "is" "the" "scientific" "study"
## [6] "of" "language" "and" "it" "involves"
## [11] "the" "analysis" "of" "language" "form,"
## [16] "language" "meaning," "and" "language" "in"
## [21] "context."
In contract, the readLines
function will read in
complete lines and result in a vector of strings representing lines (if
the entire text is in 1 line, the the entire text will be loaded as a
single string).
testtxt2 <- readLines(here::here("data", "english.txt"))
# inspect
testtxt2
## [1] "Linguistics is the scientific study of language and it involves the analysis of language form, language meaning, and language in context. "
To save text data, we can use the writeLines
function as
shown below.
writeLines(text2, here::here("data", "english.txt"))
When working with text data, it is ver common that we have to load
multiple (or many) files containing texts. In this case, we first store
the locations of the files in an object (here called fls
)
and then load the files in these locations using sapply
(a
looping function). In the sapply
function, we can use
either scan
or writeLines
to read in the text.
Below we use scan
and then combine the individual elements
into a text using the paste
function. The output shows that
we have successfully loaded 7 txt files from the testcorpus that is in
the data folder.
# extract file paths
fls <- list.files(here::here("data", "testcorpus"), pattern = "txt", full.names = T)
# load files
txts <- sapply(fls, function(x){
x <- scan(x, what = "char") %>%
paste0( collapse = " ")
})
# inspect
str(txts)
## Named chr [1:7] "Linguistics is the scientific study of language. It involves analysing language form language meaning and langu"| __truncated__ ...
## - attr(*, "names")= chr [1:7] "D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics01.txt" "D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics02.txt" "D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics03.txt" "D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics04.txt" ...
To save multiple txt files, we follow a similar procedure and first determine the paths that define where R will store the files and then loop over the files and store them in the testcorpus folder.
# define where to save each file
outs <- file.path(paste(here::here(), "/", "data/testcorpus", "/", "text", 1:7, ".txt", sep = ""))
# save the files
lapply(seq_along(txts), function(i)
writeLines(txts[[i]],
con = outs[i]))
When working withR in RStudio, it makes sense to save data as R data
objects as this requires minimal storage space and allows to load and
save data very quickly. R data objects can have any format (structured,
unstructured, lists, etc.). Here, we use the readRDS
function to load R data objects (which can represent any form or type of
data).
# load data
rdadat <- readRDS(here::here("data", "testdat.rda"))
# inspect
head(rdadat)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
To save R data objects, we use the saveRDS
function as
shown below.
saveRDS(rdadat, file = here::here("data", "testdat.rda"))
You can load all types of data discussed above from the web. the only
thing you need to do is to change the path. Instead of defining a path
on your own computer, simply replace it with a url with thin the
url
function nd the additional argument
"rb"
.
So loading the testdat.rda
from the LADAL github data
repo woudl require the following path specification:
url("https://slcladal.github.io/data/testdat.rda", "rb")
compared to the data repo in the current Rproj:
here::here("data", "testdat.rda")
See below how you can load, e.g., an rda object from the LADAL data repo on GitHub.
webdat <- base::readRDS(url("https://slcladal.github.io/data/testdat.rda", "rb"))
# inspect
head(webdat)
## Variable1 Variable2
## 1 6 67
## 2 65 16
## 3 12 56
## 4 56 34
## 5 45 54
## 6 84 42
We can then store this data as shown in the sections above.
In this section, we will briefly have a look at how to generate data in R.
To create a simple data frame, we can simply generate the columns and
then bind them together using the data.frame
function as
shown below.
# create a data frame from scratch
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
l1 <- c("english", "german", "english")
mydat <- data.frame(age, gender, l1)
# inspect
head(mydat)
## age gender l1
## 1 25 male english
## 2 30 female german
## 3 56 male english
You can also generate more complex data sets where columns or variables correlate with each other. Below, we will generate a data set with 4 correlated variables: Proficiency (the proficiency of a speaker), Abroad (whether or not subjects have been abroad), University (if they went to a standard or excellent university), and PluralError (if they produced a number marking error in a test sentence).
We start by setting seed so the generated data will be the same each time we generate the data.
set.seed(678)
Next, we create a correlation matrix, Here, we will create 4 variables and for each of these variables we have to determine how strongly each variable should be correlated with each other variable. The diagonal values are 1 as each variable correlates perfectly with itself.
cmat <- c(1.00, 0.05, 0.05, -0.5,
0.05, 1.00, 0.05, -0.3,
0.05, 0.05, 1.00, -0.1,
-0.50, -0.30, -0.10, 1.0)
Next, we generate the data using the rnorm_multi
function. In this function, we need to specify:
n
)vars
)mu
)sd
)varnames
).If all variables should have the same mean, then we only need to provide a singe value but we need to provide 4 values, if we want the variables to have different means).
dat <- faux::rnorm_multi(n = 400, vars = 4, mu = 1, sd = 1, cmat,
varnames = c("Proficiency", "Abroad", "University", "PluralError"))
# inspect
head(dat)
## Proficiency Abroad University PluralError
## 1 1.2214098 1.2103764 1.1571743 -0.09080666
## 2 0.3439251 0.4244035 2.2630978 2.19437567
## 3 0.9070220 0.7618138 0.9030697 1.62879140
## 4 2.3410885 -0.5503251 2.0893112 -0.45122521
## 5 3.0568993 1.4600793 0.7689547 -1.05360914
## 6 2.7434079 1.5160429 1.2232027 1.51467370
If you want to generate numeric data, then this would be all you need to do. If you want to generate categorical variables, however, we need to convert these numeric values into factors. In the example below, we convert all values higher than 1 (the mean) into one level, and all other values into a second level.
# modify data
dat <- dat %>%
dplyr::mutate(Proficiency = ifelse(Proficiency > 1, "Advanced", "Intermediate"),
Abroad = ifelse(Abroad > 1, "Abroad", "Home"),
University = ifelse(University > 1, "Excellent", "Standard"),
PluralError = ifelse(PluralError > 1, "Error", "Correct"))
# inspect
head(dat)
## Proficiency Abroad University PluralError
## 1 Advanced Abroad Excellent Correct
## 2 Intermediate Home Excellent Error
## 3 Intermediate Home Standard Error
## 4 Advanced Home Excellent Correct
## 5 Advanced Abroad Standard Correct
## 6 Advanced Abroad Excellent Error
And again, we could then save this data on our computer as shown in the sections above. For instance, we could save it as an MS Excel file as shown below.
write.xlsx(dat, here::here("data", "dat.xlsx"))
You may also want to create textual data (e.g., to create sample sentences or short test texts). Thus, we will briefly focus on how to create textual data in R.
The easiest way to generate text data is to simply create strings and combine them as shown below.
text <- c("This is an example sentence.", "This is a second example sentence")
# inspect
text
## [1] "This is an example sentence." "This is a second example sentence"
If you need to generate many sentences that have a standard format, you can make use of the paste function.
num <- 1:4
start <- "This is sentence number "
end <- "."
texts <- paste(start, num, end, sep = "")
# inspect
texts
## [1] "This is sentence number 1." "This is sentence number 2."
## [3] "This is sentence number 3." "This is sentence number 4."
Or, you can combine these text snippets into a single string.
onetext <- paste(start, num, end, sep = "", collapse = " ")
# inspect
onetext
## [1] "This is sentence number 1. This is sentence number 2. This is sentence number 3. This is sentence number 4."
The text can then be saved using the writeLines
function
as shown below.
writeLines(onetext, here::here("data", "onetext.txt"))
This is all for this tutorial. We hope it is useful and that you have a better idea about how to load, save and generate data now.
Schweinberger, Martin. 2022. Loading, saving, and generating data in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/load.html (Version 2022.11.08).
@manual{schweinberger2022loadr,
author = {Schweinberger, Martin},
title = {Loading, saving, and generating data in R},
note = {https://ladal.edu.au/load.html},
year = {2022},
organization = "The University of Queensland, School of Languages and Cultures},
address = {Brisbane},
edition = {2022.11.08}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.tree_1.0.0 here_1.0.1 openxlsx_4.2.5 xlsx_0.6.5
## [5] flextable_0.8.2 tidyr_1.2.0 stringr_1.4.1 dplyr_1.0.10
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.2 xfun_0.32 bslib_0.4.0 purrr_0.3.4
## [5] rJava_1.0-6 colorspace_2.0-3 vctrs_0.4.1 generics_0.1.3
## [9] htmltools_0.5.3 yaml_2.3.5 base64enc_0.1-3 utf8_1.2.2
## [13] rlang_1.0.4 jquerylib_0.1.4 pillar_1.8.1 glue_1.6.2
## [17] DBI_1.1.3 gdtools_0.2.4 faux_1.1.0 uuid_1.1-0
## [21] lifecycle_1.0.1 munsell_0.5.0 gtable_0.3.0 zip_2.2.0
## [25] evaluate_0.16 knitr_1.40 fastmap_1.1.0 fansi_1.0.3
## [29] xlsxjars_0.6.1 highr_0.9 Rcpp_1.0.9 scales_1.2.1
## [33] cachem_1.0.6 jsonlite_1.8.0 systemfonts_1.0.4 ggplot2_3.3.6
## [37] digest_0.6.29 stringi_1.7.8 grid_4.2.1 rprojroot_2.0.3
## [41] cli_3.3.0 tools_4.2.1 magrittr_2.0.3 sass_0.4.2
## [45] klippy_0.0.0.9500 tibble_3.1.8 pkgconfig_2.0.3 data.table_1.14.2
## [49] xml2_1.3.3 assertthat_0.2.1 rmarkdown_2.16 officer_0.4.4
## [53] rstudioapi_0.14 R6_2.5.1 compiler_4.2.1