Introduction

The data we work with comes in many formats and types. Therefore, this tutorial shows how you can load and save different types of data when working with R and we will have a brief look at how to generate data in R.

This tutorial is aimed at beginners with the aim of showcasing how to load and save different type of data and data structures in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify how to load and save the most common types of data in R.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.


Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).

# install packages
install.packages("xlsx")
install.packages("dplyr")
install.packages("stringr")
install.packages("tidyr")
install.packages("flextable")
install.packages("openxlsx")
install.packages("here")
install.packages("faux")
install.packages("data.tree")
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we can activate them as shown below.

# load packages
library(dplyr)
library(stringr)
library(tidyr)
library(flextable)
library(xlsx)
library(openxlsx)
library(here)
library(data.tree)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go.


NOTE

As most of the time we load data from our own computers, this tutorial also assumes that you load from your own computer.

Specifically, we assume that you have a subfolder called data in which you have stored your data in the directory where you have your R project (the Rproj file). We assume that the data sets we load are located in that data subfolder. We also show you how you can load multiple text files into R (which is common if you work with corpora, e.g.). The multiple texts are located in a folder called textcorpus within the data subfolder.

If you have a separate set-up, you have to adapt the path to the data for the tutorial to work on your own computer. (Also note that the here functions is used to create paths that start in the directory where the Rproj is located.


In other words, your directory should have the structure as shown below.

                       levelName
1  myproject                    
2   ¦--Rproj                    
3   ¦--load.Rmd                 
4   °--data                     
5       ¦--testdat.csv          
6       ¦--testdat2.csv         
7       ¦--testdat.xlsx         
8       ¦--testdat.txt          
9       ¦--testdat.rda          
10      ¦--english.rda          
11      °--testcorpus           
12          ¦--linguistics01.txt
13          ¦--linguistics02.txt
14          ¦--linguistics03.txt
15          ¦--linguistics04.txt
16          ¦--linguistics05.txt
17          ¦--linguistics06.txt
18          °--linguistics07.txt

The data used in this tutorial can be downloaded using the links below:

Tabulated data

There are several different functions that allow us to read comma separated (csv) and other Excel files into R. After we go over these, we will have a brief glance at how to create data from scratch, i.e. not loading but generating data.

CSV

A common data type when working with tabulated data are comma separated files (csv). To load such files, we can use the read.csv function as shown below.

datcsv <-read.csv(here::here("data", "testdat.csv"), header=TRUE)
# inspect
head(datcsv)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

The data is not spectacular and consist of a table with 2 columns (Variable1, and Variable2).

Sometimes, csv files are actually not comma-separated but use a semi-colon as a separator. In such cases, we can use the read.delim function to load the csv and specify that the separator (sep) is “;”.

# load csv with ;
datcsv2 <- read.delim(here::here("data", "testdat2.csv"), 
                   sep = ";", header = TRUE)
# inspect data
head(datcsv2)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

To save a data set as a csv on your computer (here it is saved within the data folder within the folder where the Rproj is located).

write.csv(datcsv, here::here("data", "testdat.csv"), row.names = F)

XLSX

To load excel data, you can use the read.xlsx function from the openxlsx package. We have activated the openxlsx package in the session preparation so we do not need to activate it again here. If you get an error message telling you that R did not find the read.xlsx function, you need to activate the openxlsx package by running the library(openxlsx).

# load data
datxlsx <- openxlsx::read.xlsx(here::here("data", "testdat.xlsx"), sheet = 1)
# inspect
head(datxlsx)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

To save xlsx files, we can use the write.xlsx from the openxlsx package as shown below.

write.xlsx(datxlsx, here::here("data", "testdat.xlsx"))

TXT (tabulated)

If the data is tabular and stored as a txt-file, there are various functions to read in the data. The most common functions are read.delim and read.table. The read.delim function is very flexible and allows you to specify the separator and inform R that the first row contains column headers rather than data points (if the data does not contain column headers, then you do not need to specify header = T because header = F is the default).

# load tab txt 1
dattxt <- read.delim(here::here("data", "testdat.txt"), 
                   sep = "\t", header = TRUE)
# inspect data
head(dattxt)

The read.table function is very similar and can also be used to load various types of tabulated data. Again, we let R know that the first row contains column headers rather than data points.

# load tab txt
dattxt2 <- read.table(here::here("data", "testdat.txt"), header = TRUE)
# inspect 
head(dattxt2)

To save tabulated txt files, we use the write.table function. In the write.table function we define the separator (in this case we write a tab-separated file) and inform R to not add row names (i.e, that R should not number rows and store this information in a separate column).

# save txt
write.table(dattxt, here::here("data", "testdat.txt"), sep = "\t", row.names = F)

Unstructured data

TXT

Unstructured data (most commonly data representing raw text) is also very common - particularly when working with corpus data.

To load text data into R (here in the form of a txt file), we can use the scan function. Reading in texts using the scan function will result in loading vectors of stings where each string represents a separate word.

testtxt <- scan(here::here("data", "english.txt"), what = "char")
# inspect
testtxt
##  [1] "Linguistics" "is"          "the"         "scientific"  "study"      
##  [6] "of"          "language"    "and"         "it"          "involves"   
## [11] "the"         "analysis"    "of"          "language"    "form,"      
## [16] "language"    "meaning,"    "and"         "language"    "in"         
## [21] "context."

In contract, the readLines function will read in complete lines and result in a vector of strings representing lines (if the entire text is in 1 line, the the entire text will be loaded as a single string).

testtxt2 <- readLines(here::here("data", "english.txt"))
# inspect
testtxt2
## [1] "Linguistics is the scientific study of language and it involves the analysis of language form, language meaning, and language in context. "

To save text data, we can use the writeLines function as shown below.

writeLines(text2, here::here("data", "english.txt"))

Multiple TXTs

When working with text data, it is ver common that we have to load multiple (or many) files containing texts. In this case, we first store the locations of the files in an object (here called fls) and then load the files in these locations using sapply (a looping function). In the sapply function, we can use either scan or writeLines to read in the text. Below we use scan and then combine the individual elements into a text using the paste function. The output shows that we have successfully loaded 7 txt files from the testcorpus that is in the data folder.

# extract file paths
fls <- list.files(here::here("data", "testcorpus"), pattern = "txt", full.names = T)
# load files
txts <- sapply(fls, function(x){
  x <- scan(x, what = "char") %>%
    paste0( collapse = " ")
  })
# inspect
str(txts)
##  Named chr [1:7] "Linguistics is the scientific study of language. It involves analysing language form language meaning and langu"| __truncated__ ...
##  - attr(*, "names")= chr [1:7] "D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics01.txt" "D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics02.txt" "D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics03.txt" "D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics04.txt" ...

To save multiple txt files, we follow a similar procedure and first determine the paths that define where R will store the files and then loop over the files and store them in the testcorpus folder.

# define where to save each file
outs <- file.path(paste(here::here(), "/", "data/testcorpus", "/", "text", 1:7, ".txt", sep = ""))
# save the files
lapply(seq_along(txts), function(i) 
       writeLines(txts[[i]],  
       con = outs[i]))

R data objects

When working withR in RStudio, it makes sense to save data as R data objects as this requires minimal storage space and allows to load and save data very quickly. R data objects can have any format (structured, unstructured, lists, etc.). Here, we use the readRDS function to load R data objects (which can represent any form or type of data).

# load data
rdadat <- readRDS(here::here("data", "testdat.rda"))
# inspect
head(rdadat)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

To save R data objects, we use the saveRDS function as shown below.

saveRDS(rdadat, file = here::here("data", "testdat.rda"))

Web data

You can load all types of data discussed above from the web. the only thing you need to do is to change the path. Instead of defining a path on your own computer, simply replace it with a url with thin the url function nd the additional argument "rb".

So loading the testdat.rda from the LADAL github data repo woudl require the following path specification:

url("https://slcladal.github.io/data/testdat.rda", "rb")

compared to the data repo in the current Rproj:

here::here("data", "testdat.rda")

See below how you can load, e.g., an rda object from the LADAL data repo on GitHub.

webdat <- base::readRDS(url("https://slcladal.github.io/data/testdat.rda", "rb"))
# inspect
head(webdat)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

We can then store this data as shown in the sections above.

Generating data

In this section, we will briefly have a look at how to generate data in R.

Creating tabular data

To create a simple data frame, we can simply generate the columns and then bind them together using the data.frame function as shown below.

# create a data frame from scratch
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
l1 <- c("english", "german", "english")
mydat <- data.frame(age, gender, l1)
# inspect
head(mydat)
##   age gender      l1
## 1  25   male english
## 2  30 female  german
## 3  56   male english

You can also generate more complex data sets where columns or variables correlate with each other. Below, we will generate a data set with 4 correlated variables: Proficiency (the proficiency of a speaker), Abroad (whether or not subjects have been abroad), University (if they went to a standard or excellent university), and PluralError (if they produced a number marking error in a test sentence).

We start by setting seed so the generated data will be the same each time we generate the data.

set.seed(678)

Next, we create a correlation matrix, Here, we will create 4 variables and for each of these variables we have to determine how strongly each variable should be correlated with each other variable. The diagonal values are 1 as each variable correlates perfectly with itself.

cmat <- c(1.00,  0.05,  0.05, -0.5, 
          0.05,  1.00,  0.05, -0.3,
          0.05,  0.05,  1.00, -0.1,
         -0.50, -0.30, -0.10,  1.0)

Next, we generate the data using the rnorm_multi function. In this function, we need to specify:

  • how many data points the data set should consist of (n)
  • the number of variables (vars)
  • the means (mu)
  • the standard deviation (sd)
  • the correlations (here we specify the correlation matrix we defined above)
  • the names of the variables (varnames).

If all variables should have the same mean, then we only need to provide a singe value but we need to provide 4 values, if we want the variables to have different means).

dat <- faux::rnorm_multi(n = 400, vars = 4, mu = 1, sd = 1, cmat, 
                         varnames = c("Proficiency", "Abroad", "University", "PluralError"))
# inspect
head(dat)
##   Proficiency     Abroad University PluralError
## 1   1.2214098  1.2103764  1.1571743 -0.09080666
## 2   0.3439251  0.4244035  2.2630978  2.19437567
## 3   0.9070220  0.7618138  0.9030697  1.62879140
## 4   2.3410885 -0.5503251  2.0893112 -0.45122521
## 5   3.0568993  1.4600793  0.7689547 -1.05360914
## 6   2.7434079  1.5160429  1.2232027  1.51467370

If you want to generate numeric data, then this would be all you need to do. If you want to generate categorical variables, however, we need to convert these numeric values into factors. In the example below, we convert all values higher than 1 (the mean) into one level, and all other values into a second level.

# modify data
dat <- dat %>%
  dplyr::mutate(Proficiency = ifelse(Proficiency > 1, "Advanced", "Intermediate"),
                Abroad = ifelse(Abroad > 1, "Abroad", "Home"),
                University = ifelse(University > 1, "Excellent", "Standard"),
                PluralError = ifelse(PluralError > 1, "Error", "Correct"))
# inspect
head(dat)
##    Proficiency Abroad University PluralError
## 1     Advanced Abroad  Excellent     Correct
## 2 Intermediate   Home  Excellent       Error
## 3 Intermediate   Home   Standard       Error
## 4     Advanced   Home  Excellent     Correct
## 5     Advanced Abroad   Standard     Correct
## 6     Advanced Abroad  Excellent       Error

And again, we could then save this data on our computer as shown in the sections above. For instance, we could save it as an MS Excel file as shown below.

write.xlsx(dat, here::here("data", "dat.xlsx"))

Creating text data

You may also want to create textual data (e.g., to create sample sentences or short test texts). Thus, we will briefly focus on how to create textual data in R.

The easiest way to generate text data is to simply create strings and combine them as shown below.

text <- c("This is an example sentence.", "This is a second example sentence")
# inspect
text
## [1] "This is an example sentence."      "This is a second example sentence"

If you need to generate many sentences that have a standard format, you can make use of the paste function.

num <- 1:4
start <- "This is sentence number "
end <- "."
texts <- paste(start, num, end, sep = "")
# inspect
texts
## [1] "This is sentence number 1." "This is sentence number 2."
## [3] "This is sentence number 3." "This is sentence number 4."

Or, you can combine these text snippets into a single string.

onetext <- paste(start, num, end, sep = "", collapse = " ")
# inspect
onetext
## [1] "This is sentence number 1. This is sentence number 2. This is sentence number 3. This is sentence number 4."

The text can then be saved using the writeLines function as shown below.

writeLines(onetext, here::here("data", "onetext.txt"))

This is all for this tutorial. We hope it is useful and that you have a better idea about how to load, save and generate data now.

Citation & Session Info

Schweinberger, Martin. 2022. Loading, saving, and generating data in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/load.html (Version 2022.11.08).

@manual{schweinberger2022loadr,
  author = {Schweinberger, Martin},
  title = {Loading, saving, and generating data in R},
  note = {https://ladal.edu.au/load.html},
  year = {2022},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.11.08}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.tree_1.0.0 here_1.0.1      openxlsx_4.2.5  xlsx_0.6.5     
## [5] flextable_0.8.2 tidyr_1.2.0     stringr_1.4.1   dplyr_1.0.10   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2  xfun_0.32         bslib_0.4.0       purrr_0.3.4      
##  [5] rJava_1.0-6       colorspace_2.0-3  vctrs_0.4.1       generics_0.1.3   
##  [9] htmltools_0.5.3   yaml_2.3.5        base64enc_0.1-3   utf8_1.2.2       
## [13] rlang_1.0.4       jquerylib_0.1.4   pillar_1.8.1      glue_1.6.2       
## [17] DBI_1.1.3         gdtools_0.2.4     faux_1.1.0        uuid_1.1-0       
## [21] lifecycle_1.0.1   munsell_0.5.0     gtable_0.3.0      zip_2.2.0        
## [25] evaluate_0.16     knitr_1.40        fastmap_1.1.0     fansi_1.0.3      
## [29] xlsxjars_0.6.1    highr_0.9         Rcpp_1.0.9        scales_1.2.1     
## [33] cachem_1.0.6      jsonlite_1.8.0    systemfonts_1.0.4 ggplot2_3.3.6    
## [37] digest_0.6.29     stringi_1.7.8     grid_4.2.1        rprojroot_2.0.3  
## [41] cli_3.3.0         tools_4.2.1       magrittr_2.0.3    sass_0.4.2       
## [45] klippy_0.0.0.9500 tibble_3.1.8      pkgconfig_2.0.3   data.table_1.14.2
## [49] xml2_1.3.3        assertthat_0.2.1  rmarkdown_2.16    officer_0.4.4    
## [53] rstudioapi_0.14   R6_2.5.1          compiler_4.2.1