Introduction

The data we work with comes in many formats and types. Therefore, this tutorial shows how you can load and save different types of data when working with R and we will have a brief look at how to generate data in R.

This tutorial is aimed at beginners with the aim of showcasing how to load and save different type of data and data structures in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify how to load and save the most common types of data in R.

To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following R Basics tutorials:

Click here1 to download the entire R Notebook for this tutorial.


Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).

# install packages
install.packages("xlsx")
install.packages("dplyr")
install.packages("stringr")
install.packages("tidyr")
install.packages("flextable")
install.packages("openxlsx")
install.packages("here")
install.packages("faux")
install.packages("data.tree")
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we can activate them as shown below.

# load packages
library(dplyr)
library(stringr)
library(tidyr)
library(flextable)
library(xlsx)
library(openxlsx)
library(here)
library(data.tree)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go.


NOTE

This tutorial assumes that you will be loading data from your own computer, as is often the case.

This tutorial assumes that you have a designated subfolder named data within the directory where your R project (the Rproj file) is located. It is assumed that your data sets are stored in this data subfolder. Additionally, we provide guidance on how to load multiple text files into R, a common scenario when working with corpora. These multiple texts are expected to be situated within a folder named textcorpus, which is located within the data subfolder.

If you have a different setup, you will need to adjust the path to the data in order for the tutorial to function correctly on your own computer. It is worth mentioning that the here function is utilized to create paths that originate from the directory where the Rproj is located.


In other words, your directory should have the structure as shown below.

                       levelName
1  myproject                    
2   ¦--Rproj                    
3   ¦--load.Rmd                 
4   °--data                     
5       ¦--testdat.csv          
6       ¦--testdat2.csv         
7       ¦--testdat.xlsx         
8       ¦--testdat.txt          
9       ¦--testdat.rda          
10      ¦--english.rda          
11      °--testcorpus           
12          ¦--linguistics01.txt
13          ¦--linguistics02.txt
14          ¦--linguistics03.txt
15          ¦--linguistics04.txt
16          ¦--linguistics05.txt
17          ¦--linguistics06.txt
18          °--linguistics07.txt

The data used in this tutorial can be downloaded using the links below:

Tabulated data

In R, there are multiple functions available for reading comma-separated (csv) and other Excel files. Once we cover these functions, we will briefly explore the process of generating data from scratch, without relying on loading pre-existing data files.

CSV

A common data type when working with tabulated data are comma separated files (csv). To load such files, we can use the read.csv function as shown below.

# load csv file
datcsv <- read.csv(here::here("data", "testdat.csv"),
                   # indicate the data has column names
                   header=TRUE) 
# inspect first 6 rows using the head() function
head(datcsv)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

The data is not spectacular and consist of a table with 2 columns (Variable1, and Variable2).

Sometimes, csv files are actually not comma-separated but use a semi-colon as a separator. In such cases, we can use the read.delim function to load the csv and specify that the separator (sep) is “;”.

# load csv with ; as the separator
datcsv2 <- read.delim(here::here("data", "testdat2.csv"),
                      # define separator
                      sep = ";", 
                      # indicate the the data has column names
                      header = TRUE)
# inspect data
head(datcsv2)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

To save a data set as a csv on your computer (here it is saved within the data folder within the folder where the Rproj is located).

# save data as a csv without row names
write.csv(datcsv, here::here("data", "testdat.csv"), row.names = F)

XLSX

To load excel data, you can use the read.xlsx function from the openxlsx package. We have activated the openxlsx package in the session preparation so we do not need to activate it again here. If you get an error message telling you that R did not find the read.xlsx function, you need to activate the openxlsx package by running the library(openxlsx).

# load data
datxlsx <- openxlsx::read.xlsx(
  # define path where data is stored
  here::here("data", "testdat.xlsx"),
  # define spreadsheet to load
  sheet = 1)
# inspect first 6 rows using the head() function
head(datxlsx)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

To save xlsx files, we can use the write.xlsx from the openxlsx package as shown below.

write.xlsx(
  # define object to be stored
  datxlsx, 
  # define path where data should be stored
  here::here("data", "testdat.xlsx"))

TXT (tabulated)

If the data is tabular and stored as a txt-file, there are various functions to read in the data. The most common functions are read.delim and read.table. The read.delim function is very flexible and allows you to specify the separator and inform R that the first row contains column headers rather than data points (if the data does not contain column headers, then you do not need to specify header = T because header = F is the default).

# load tab txt 1
dattxt <- read.delim(here::here("data", "testdat.txt"), 
                   sep = "\t", header = TRUE)
# inspect data
head(dattxt)

The read.table function is very similar and can also be used to load various types of tabulated data. Again, we let R know that the first row contains column headers rather than data points.

# load tab txt
dattxt2 <- read.table(here::here("data", "testdat.txt"), header = TRUE)
# inspect 
head(dattxt2)

To save tabulated txt files, we use the write.table function. In the write.table function we define the separator (in this case we write a tab-separated file) and inform R to not add row names (i.e, that R should not number rows and store this information in a separate column).

# save txt
write.table(dattxt, here::here("data", "testdat.txt"), sep = "\t", row.names = F)

Unstructured data

TXT

Unstructured data (most commonly data representing raw text) is also very common - particularly when working with corpus data.

To load text data into R (here in the form of a txt file), we can use the scan function. Reading in texts using the scan function will result in loading vectors of stings where each string represents a separate word.

testtxt <- scan(here::here("data", "english.txt"), what = "char")
# inspect
testtxt
##  [1] "Linguistics" "is"          "the"         "scientific"  "study"      
##  [6] "of"          "language"    "and"         "it"          "involves"   
## [11] "the"         "analysis"    "of"          "language"    "form,"      
## [16] "language"    "meaning,"    "and"         "language"    "in"         
## [21] "context."

In contract, the readLines function will read in complete lines and result in a vector of strings representing lines (if the entire text is in 1 line, the the entire text will be loaded as a single string).

testtxt2 <- readLines(here::here("data", "english.txt"))
# inspect
testtxt2
## [1] "Linguistics is the scientific study of language and it involves the analysis of language form, language meaning, and language in context. "

To save text data, we can use the writeLines function as shown below.

writeLines(text2, here::here("data", "english.txt"))

Multiple TXTs

When dealing with text data, it is quite common to encounter scenarios where we need to load multiple files containing texts. In such cases, we typically begin by storing the file locations in an object (referred to as fls in this context) and then proceed to load the files using the sapply function, which allows for looping. Within the sapply function, we have the option to utilize either scan or writeLines for reading the text. In the example below, we employ scan and subsequently merge the individual elements into a single text using the paste function. The output demonstrates the successful loading of 7 txt files from the testcorpus located within the data folder.

# extract file paths
fls <- list.files(here::here("data", "testcorpus"), pattern = "txt", full.names = T)
# load files
txts <- sapply(fls, function(x){
  x <- scan(x, what = "char") %>%
    paste0( collapse = " ")
  })
# inspect
str(txts)
##  Named chr [1:7] "Linguistics is the scientific study of language. It involves analysing language form language meaning and langu"| __truncated__ ...
##  - attr(*, "names")= chr [1:7] "F:/data recovery/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics01.txt" "F:/data recovery/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics02.txt" "F:/data recovery/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics03.txt" "F:/data recovery/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data/testcorpus/linguistics04.txt" ...

To save multiple txt files, we follow a similar procedure and first determine the paths that define where R will store the files and then loop over the files and store them in the testcorpus folder.

# define where to save each file
outs <- file.path(paste(here::here(), "/", "data/testcorpus", "/", "text", 1:7, ".txt", sep = ""))
# save the files
lapply(seq_along(txts), function(i) 
       writeLines(txts[[i]],  
       con = outs[i]))

R data objects

When working withR in RStudio, it makes sense to save data as R data objects as this requires minimal storage space and allows to load and save data very quickly. R data objects can have any format (structured, unstructured, lists, etc.). Here, we use the readRDS function to load R data objects (which can represent any form or type of data).

# load data
rdadat <- readRDS(here::here("data", "testdat.rda"))
# inspect
head(rdadat)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

To save R data objects, we use the saveRDS function as shown below.

saveRDS(rdadat, file = here::here("data", "testdat.rda"))

Web data

You can load all types of data discussed above from the web. the only thing you need to do is to change the path. Instead of defining a path on your own computer, simply replace it with a url with thin the url function nd the additional argument "rb".

So loading the testdat.rda from the LADAL github data repo would require the following path specification:

url("https://slcladal.github.io/data/testdat.rda", "rb")

compared to the data repo in the current Rproj:

here::here("data", "testdat.rda")

See below how you can load, e.g., an rda object from the LADAL data repo on GitHub.

webdat <- base::readRDS(url("https://slcladal.github.io/data/testdat.rda", "rb"))
# inspect
head(webdat)
##   Variable1 Variable2
## 1         6        67
## 2        65        16
## 3        12        56
## 4        56        34
## 5        45        54
## 6        84        42

We can then store this data as shown in the sections above.

Generating data

In this section, we will briefly have a look at how to generate data in R.

Creating tabular data

To create a simple data frame, we can simply generate the columns and then bind them together using the data.frame function as shown below.

# create a data frame from scratch
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
l1 <- c("english", "german", "english")
mydat <- data.frame(age, gender, l1)
# inspect
head(mydat)
##   age gender      l1
## 1  25   male english
## 2  30 female  german
## 3  56   male english

You can also generate more complex data sets where columns or variables correlate with each other. Below, we will generate a data set with 4 correlated variables: Proficiency (the proficiency of a speaker), Abroad (whether or not subjects have been abroad), University (if they went to a standard or excellent university), and PluralError (if they produced a number marking error in a test sentence).

We start by setting seed so the generated data will be the same each time we generate the data.

set.seed(678)

Next, we create a correlation matrix, Here, we will create 4 variables and for each of these variables we have to determine how strongly each variable should be correlated with each other variable. The diagonal values are 1 as each variable correlates perfectly with itself.

cmat <- c(1.00,  0.05,  0.05, -0.5, 
          0.05,  1.00,  0.05, -0.3,
          0.05,  0.05,  1.00, -0.1,
         -0.50, -0.30, -0.10,  1.0)

Next, we generate the data using the rnorm_multi function. In this function, we need to specify:

  • how many data points the data set should consist of (n)
  • the number of variables (vars)
  • the means (mu)
  • the standard deviation (sd)
  • the correlations (here we specify the correlation matrix we defined above)
  • the names of the variables (varnames).

If all variables should have the same mean, then we only need to provide a singe value but we need to provide 4 values, if we want the variables to have different means).

dat <- faux::rnorm_multi(n = 400, vars = 4, mu = 1, sd = 1, cmat, 
                         varnames = c("Proficiency", "Abroad", "University", "PluralError"))
# inspect
head(dat)
##   Proficiency     Abroad University PluralError
## 1   1.2214098  1.2103764  1.1571743 -0.09080666
## 2   0.3439251  0.4244035  2.2630978  2.19437567
## 3   0.9070220  0.7618138  0.9030697  1.62879140
## 4   2.3410885 -0.5503251  2.0893112 -0.45122521
## 5   3.0568993  1.4600793  0.7689547 -1.05360914
## 6   2.7434079  1.5160429  1.2232027  1.51467370

If you want to generate numeric data, then this would be all you need to do. If you want to generate categorical variables, however, we need to convert these numeric values into factors. In the example below, we convert all values higher than 1 (the mean) into one level, and all other values into a second level.

# modify data
dat <- dat %>%
  dplyr::mutate(Proficiency = ifelse(Proficiency > 1, "Advanced", "Intermediate"),
                Abroad = ifelse(Abroad > 1, "Abroad", "Home"),
                University = ifelse(University > 1, "Excellent", "Standard"),
                PluralError = ifelse(PluralError > 1, "Error", "Correct"))
# inspect
head(dat)
##    Proficiency Abroad University PluralError
## 1     Advanced Abroad  Excellent     Correct
## 2 Intermediate   Home  Excellent       Error
## 3 Intermediate   Home   Standard       Error
## 4     Advanced   Home  Excellent     Correct
## 5     Advanced Abroad   Standard     Correct
## 6     Advanced Abroad  Excellent       Error

And again, we could then save this data on our computer as shown in the sections above. For instance, we could save it as an MS Excel file as shown below.

write.xlsx(dat, here::here("data", "dat.xlsx"))

Creating text data

You may also want to create textual data (e.g., to create sample sentences or short test texts). Thus, we will briefly focus on how to create textual data in R.

The easiest way to generate text data is to simply create strings and combine them as shown below.

text <- c("This is an example sentence.", "This is a second example sentence")
# inspect
text
## [1] "This is an example sentence."      "This is a second example sentence"

If you need to generate many sentences that have a standard format, you can make use of the paste function.

num <- 1:4
start <- "This is sentence number "
end <- "."
texts <- paste(start, num, end, sep = "")
# inspect
texts
## [1] "This is sentence number 1." "This is sentence number 2."
## [3] "This is sentence number 3." "This is sentence number 4."

Or, you can combine these text snippets into a single string.

onetext <- paste(start, num, end, sep = "", collapse = " ")
# inspect
onetext
## [1] "This is sentence number 1. This is sentence number 2. This is sentence number 3. This is sentence number 4."

The text can then be saved using the writeLines function as shown below.

writeLines(onetext, here::here("data", "onetext.txt"))

This is all for this tutorial. We hope it is useful and that you have a better idea about how to load, save and generate data now.

Citation & Session Info

Schweinberger, Martin. 2022. Loading, saving, and generating data in R. Brisbane: The University of Queensland. URL: https://ladal.edu.au/load.html (Version 2022.11.08).

@manual{schweinberger2022loadr,
  author = {Schweinberger, Martin},
  title = {Loading, saving, and generating data in R},
  note = {https://ladal.edu.au/load.html},
  year = {2022},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.11.08}
}
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.tree_1.0.0  here_1.0.1       openxlsx_4.2.5.2 xlsx_0.6.5      
## [5] flextable_0.9.1  tidyr_1.3.0      stringr_1.5.0    dplyr_1.1.2     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10             xlsxjars_0.6.1          assertthat_0.2.1       
##  [4] rprojroot_2.0.3         digest_0.6.31           utf8_1.2.3             
##  [7] mime_0.12               R6_2.5.1                evaluate_0.21          
## [10] ggplot2_3.4.2           highr_0.10              pillar_1.9.0           
## [13] gdtools_0.3.3           rlang_1.1.1             curl_5.0.0             
## [16] uuid_1.1-0              rstudioapi_0.14         data.table_1.14.8      
## [19] jquerylib_0.1.4         klippy_0.0.0.9500       rmarkdown_2.21         
## [22] textshaping_0.3.6       munsell_0.5.0           shiny_1.7.4            
## [25] compiler_4.2.2          httpuv_1.6.11           xfun_0.39              
## [28] pkgconfig_2.0.3         askpass_1.1             systemfonts_1.0.4      
## [31] gfonts_0.2.0            htmltools_0.5.5         faux_1.2.1             
## [34] openssl_2.0.6           tidyselect_1.2.0        tibble_3.2.1           
## [37] fontBitstreamVera_0.1.1 httpcode_0.3.0          fansi_1.0.4            
## [40] crayon_1.5.2            later_1.3.1             crul_1.4.0             
## [43] grid_4.2.2              gtable_0.3.3            jsonlite_1.8.4         
## [46] xtable_1.8-4            lifecycle_1.0.3         magrittr_2.0.3         
## [49] scales_1.2.1            zip_2.3.0               cli_3.6.1              
## [52] stringi_1.7.12          cachem_1.0.8            promises_1.2.0.1       
## [55] xml2_1.3.4              bslib_0.4.2             ellipsis_0.3.2         
## [58] ragg_1.2.5              generics_0.1.3          vctrs_0.6.2            
## [61] tools_4.2.2             glue_1.6.2              officer_0.6.2          
## [64] fontquiver_0.2.1        purrr_1.0.1             fastmap_1.1.1          
## [67] yaml_2.3.7              colorspace_2.1-0        fontLiberation_0.1.0   
## [70] rJava_1.0-6             knitr_1.43              sass_0.4.6

Back to top

Back to HOME



  1. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎