This tutorial shows how to work with tables and how to tabulate data in R. A more advanced and detailed but also truly excellent and highly recommendable resource on processing data in R is Wickham and Grolemund (2016; see also Wickham et al. 2019). Alternative but also very useful resources are Stander and Dalla Valle (2017) and Estrellado et al. (2020).
This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to work with and process tabulated data (tables) using R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful functions and methods associated with tabular data.
To be able to follow this tutorial, we suggest you check out and
familiarize yourself with the content of the following R
Basics tutorials:
Click here1 to
download the entire R Notebook for this
tutorial.
Click
here
to open an interactive Jupyter notebook that allows you to execute,
change, and edit the code as well as to upload your own data.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).
# install packages
install.packages("xlsx")
install.packages("tidyverse")
install.packages("flextable")
install.packages("openxlsx")
install.packages("here")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")
Now that we have installed the packages, we can activate them as shown below.
# load packages
library(dplyr)
library(tidyr)
library(flextable)
library(xlsx)
library(openxlsx)
library(here)
# activate klippy for copy-to-clipboard button
klippy::klippy()
Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go.
Tables are one of the most common and important ways to process, handle, and describe data. This tutorial introduces different types of tables, how to load and save different types of tables, as well as how to modify and process tables and tabulated data.
When working with R, there are different kinds or types of tables that have different characteristics. The most common types of tables in R are:
Matrices can only contain one type of data and all data points will be converted to the type of scale with the lowest information value. For instance, if at least one variables in a table represents characters (or strings), then all other variables are also converted into characters (although they may be numbers).
Data frames can contain different types of scales and variables are not converted automatically. In this sense, data frames are more flexible and are preferable over matrices.
Tibbles are the tidyverse equivalent of data frames which offer new functions and possibilities of handling and inspecting the data. .
There are several different functions that allow us to read tabulated
data into R. In our case, we use the readRDS
function which
loads Rdata sets.
# load data with read.delim
mytable <- base::readRDS(url("https://slcladal.github.io/data/mld.rda", "rb"))
status | attraction | money |
---|---|---|
Relationship | NotInterested | 86.33 |
Relationship | NotInterested | 45.58 |
Relationship | NotInterested | 68.43 |
Relationship | NotInterested | 52.93 |
Relationship | NotInterested | 61.86 |
Relationship | NotInterested | 48.47 |
Relationship | NotInterested | 32.79 |
Relationship | NotInterested | 35.91 |
Relationship | NotInterested | 30.98 |
Relationship | NotInterested | 44.82 |
If the data is stored as a txt-file, there are various functions to
read in the data. The most common functions are read.delim
and read.table
.
# load data with read.delim
tab1 <- read.delim("https://slcladal.github.io/data/mlrdata.txt",
sep = "\t", header = TRUE)
tab2 <- read.table("https://slcladal.github.io/data/mlrdata.txt", header = TRUE)
To load excel data from the web, you can use the
read.xlsx
function from the openxlsx
package.
# load data
exceldata <- openxlsx::read.xlsx("https://slcladal.github.io/data/testdata1.xlsx",
sheet = 1)
Variable1 | Variable2 |
---|---|
6 | 67 |
65 | 16 |
12 | 56 |
56 | 34 |
45 | 54 |
84 | 42 |
38 | 36 |
46 | 47 |
64 | 54 |
24 | 29 |
To load excel data from your own computer, you can use the
read_excel
function from the readxl
package
(which is part of the tidyverse and thus does not need to be installed
separately.
NOTE
You would need to adapt the path to the data (which
currently points to the data folder on my computer!). Thus, the path
below is specified to my computer.
`
`
# load data
excelcomp <- readxl::read_excel(here::here("data", "testdata1.xlsx"), sheet = 1)
Variable1 | Variable2 |
---|---|
6 | 67 |
65 | 16 |
12 | 56 |
56 | 34 |
45 | 54 |
84 | 42 |
38 | 36 |
46 | 47 |
64 | 54 |
24 | 29 |
The most common functions that are used to inspect tabular data are
the head()
and the str()
functions. The
head()
function shows the first 6 lines (by default) but we
can also specify the number of rows. The str()
function
provides a summary of the structure of the data.
Inspecting the first 6 rows of a table.
head(mytable)
## status attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47
Inspecting the first 10 rows of a table.
head(mytable, 10)
## status attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47
## 7 Relationship NotInterested 32.79
## 8 Relationship NotInterested 35.91
## 9 Relationship NotInterested 30.98
## 10 Relationship NotInterested 44.82
Checking the structure of tabulated data.
str(mytable)
## 'data.frame': 100 obs. of 3 variables:
## $ status : chr "Relationship" "Relationship" "Relationship" "Relationship" ...
## $ attraction: chr "NotInterested" "NotInterested" "NotInterested" "NotInterested" ...
## $ money : num 86.3 45.6 68.4 52.9 61.9 ...
The following section shows how to access and manipulate tables.
The tidyverse is a specific way of writing R code that builds on a family of libraries designed for efficient data science work flows which were developed initially by Hadley Wickham. This new way of writing R code builds on a shared and underlying design philosophy and grammar. Due to its popularity and ease of use, the tidyverse way to write R code is becoming increasingly popular and we will use it in the following to handle and manipulate tabulated data.
If you have already loaded data into R and now want to process the
data, you typically have to modify the data in some form or another to
get the information or format you need. The tidyverse offers very
user-friendly, intuitive, and handy functions for processing the data to
match the needs of your analysis. To have access to the tidyverse
functions for data processing, we load the tidyverse
package and load and inspect another set of data using the
read.delim
function. The new data is stored as a txt file
and has 100 observations (rows) and 3 variables (status,
attraction, and money). The data set represents how
much money people have spend in someone they were interested in or not
(attraction: Interested versus NotInterested) and
their own relationship status (status: Single versus
Relationship).
# load new data
newdata <- base::readRDS(url("https://slcladal.github.io/data/mld.rda", "rb"))
status | attraction | money |
---|---|---|
Relationship | NotInterested | 86.33 |
Relationship | NotInterested | 45.58 |
Relationship | NotInterested | 68.43 |
Relationship | NotInterested | 52.93 |
Relationship | NotInterested | 61.86 |
Relationship | NotInterested | 48.47 |
Relationship | NotInterested | 32.79 |
Relationship | NotInterested | 35.91 |
Relationship | NotInterested | 30.98 |
Relationship | NotInterested | 44.82 |
The table represents 3 variables (status, attraction, and money) and each row contains information on the relationship status of 100 people and how much money these people would spend on a gift to someone of the opposite sex who they are or are not interested in. We will now check out different ways and functions to process this data.
Piping, done with the sequence %>%
, is a very easy,
intuitive, quick, and handy way to process data. Essentially piping
means that we take an element that is to the left of the piping symbol
and then do something to it; that way, the piping symbol can be
translated as and then.
We could, for example, load data and then capitalize the column names and then group the data by status and attraction and then get the mean of money spend on deleting all observations except for the first one. A more formal way to write this would be:
load %>% capitalize %>% group %>% summarize.
In R this command would look like this:
# example of a data processing pipeline
pipeddata <- base::readRDS(url("https://slcladal.github.io/data/mld.rda", "rb")) %>%
dplyr::rename(Status = status, Attraction = attraction, Money = money) %>%
dplyr::group_by(Status, Attraction) %>%
dplyr::summarise(Mean = mean(Money))
# inspect summarized data
pipeddata
## # A tibble: 4 × 3
## # Groups: Status [2]
## Status Attraction Mean
## <chr> <chr> <dbl>
## 1 Relationship Interested 99.2
## 2 Relationship NotInterested 51.5
## 3 Single Interested 157.
## 4 Single NotInterested 46.0
The pipe has worked and we get the resulting summary which shows the mean of the money spend based on Attraction and Status.
Among the most frequent procedures in data processing is selecting
certain columns or extracting rows based on variable levels. In the
tidyverse, this is done by using the select
and
filter
functions. While select
allows to
extract columns, filter
is used to extract rows, e.g. to
get only observations that have a certain feature. Have a look at the
example below.
# select and filter
reduceddata <- newdata %>%
# select the columns attraction and money
dplyr::select(attraction, money) %>%
# extract rows which represent cases where the person was interested in someone
dplyr::filter(attraction == "Interested")
# inspect new table
nrow(reduceddata); table(reduceddata$attraction)
## [1] 50
##
## Interested
## 50
We have now reduced the data by excluding status (we have only
selected attraction and money) and we have removed those 50 data rows of
people who were not interested. The select
function (like
most other tidyverse functions) can also be used together with a minus
sign which causes a column to be removed, thus
dplyr::select(-money)
would remove the money column (see
below).
# select and filter
datawithoutmoney <- newdata %>%
# remove money
dplyr::select(-money)
# inspect data
head(datawithoutmoney)
## status attraction
## 1 Relationship NotInterested
## 2 Relationship NotInterested
## 3 Relationship NotInterested
## 4 Relationship NotInterested
## 5 Relationship NotInterested
## 6 Relationship NotInterested
Selecting and filtering are extremely powerful functions that can also be combined with other functions. But before we discuss more complex issues, we will have a look at how we can change columns.
Changing and adding data is done with the mutate
function. The mutate
functions requires that we specify a
column name - if we use the same name as the column we are changing,
then we change the column but if we specify another column name, then a
new column is created.
We will now create a new column (Spendalot) in which we encode if the person has spend a lot of money (100 AUD or more) on the present or not (less than 100 AUD).
# creating a new column
newdata <- newdata %>%
dplyr::mutate(Spendalot = ifelse(money >= 100, "Alot", "Alittle"))
# inspect data
head(newdata)
## status attraction money Spendalot
## 1 Relationship NotInterested 86.33 Alittle
## 2 Relationship NotInterested 45.58 Alittle
## 3 Relationship NotInterested 68.43 Alittle
## 4 Relationship NotInterested 52.93 Alittle
## 5 Relationship NotInterested 61.86 Alittle
## 6 Relationship NotInterested 48.47 Alittle
The table now has a new column (Spendalot) because we have specified
a column name that did not exist yet - had we written
dplyr::mutate(money = ifelse(money >= 100, "Alot", "Alittle"))
then we would have changed the money column and replaced the money
values with the labels Alot and Alittle.
Oftentimes, column names are not really meaningful or incoherent which makes it easier to wrap your head around what the values in a column refer to. The easiest way around this is rename columns which is, fortunately very simple in the tidyverse. While the column names of our example table are meaningful, I want to capitalize the first letter of each column name. This can be done as follows.
# renaming columns
newdata <- newdata %>%
dplyr::rename(Status = status, Attraction = attraction, Money = money)
# inspect data
head(newdata)
## Status Attraction Money Spendalot
## 1 Relationship NotInterested 86.33 Alittle
## 2 Relationship NotInterested 45.58 Alittle
## 3 Relationship NotInterested 68.43 Alittle
## 4 Relationship NotInterested 52.93 Alittle
## 5 Relationship NotInterested 61.86 Alittle
## 6 Relationship NotInterested 48.47 Alittle
The renaming was successful as all column names now begin with a capital letter.
In contrast to mutate, which retains the number of rows, summarizing creates new columns but collapses rows and only provides the summary value (or values if more than one summary is specified). Also, columns that are not grouping variables are removed.
Summarizing is particularly useful when we want to get summaries of groups. We will modify the example from above and extract the mean and the standard deviation of the money spend on presents by relationship status and whether the giver was attracted to the giv-ee.
#grouping and summarizing data
datasummary <- newdata %>%
dplyr::group_by(Status, Attraction) %>%
dplyr::summarise(Mean = round(mean(Money), 2), SD = round(sd(Money), 1))
# inspect summarized data
datasummary
## # A tibble: 4 × 4
## # Groups: Status [2]
## Status Attraction Mean SD
## <chr> <chr> <dbl> <dbl>
## 1 Relationship Interested 99.2 14.7
## 2 Relationship NotInterested 51.5 17
## 3 Single Interested 157. 23.2
## 4 Single NotInterested 46.0 19.9
One very common problem is that data - or at least parts of it - have
to be transformed from long to wide format or vice versa. In the
tidyverse, this is done using the gather
and
spread
function. We will convert the summary table shown
above into a wide format (we also remove the SD column as it is no
longer needed)
# converting data to wide format
widedata <- datasummary %>%
# remove SD column
dplyr::select(-SD) %>%
# convert into wide format
tidyr::spread(Attraction, Mean)
# inspect wide data
widedata
## # A tibble: 2 × 3
## # Groups: Status [2]
## Status Interested NotInterested
## <chr> <dbl> <dbl>
## 1 Relationship 99.2 51.5
## 2 Single 157. 46.0
We can re-convert the wide into a long format using the
gather
function.
# converting data to long format
longdata <- widedata %>%
# convert into long format
tidyr::gather(Attraction, Money, Interested:NotInterested)
# inspect wide data
longdata
## # A tibble: 4 × 3
## # Groups: Status [2]
## Status Attraction Money
## <chr> <chr> <dbl>
## 1 Relationship Interested 99.2
## 2 Single Interested 157.
## 3 Relationship NotInterested 51.5
## 4 Single NotInterested 46.0
There are many more useful functions for processing, handling, and summarizing tables but this should suffice to get you started.
Schweinberger, Martin. 2022. Handling tables in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/table.html (Version 2022.11.17).
@manual{schweinberger2022tabr,
author = {Schweinberger, Martin},
title = {Handling tables in R},
note = {https://ladal.edu.au/table.html},
year = {2022},
organization = "The University of Queensland, School of Languages and Cultures},
address = {Brisbane},
edition = {2022.11.17}
}
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] here_1.0.1 openxlsx_4.2.5.2 xlsx_0.6.5 flextable_0.9.1
## [5] tidyr_1.3.0 dplyr_1.1.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 xlsxjars_0.6.1 assertthat_0.2.1
## [4] rprojroot_2.0.3 digest_0.6.31 utf8_1.2.3
## [7] mime_0.12 cellranger_1.1.0 R6_2.5.1
## [10] evaluate_0.21 highr_0.10 pillar_1.9.0
## [13] gdtools_0.3.3 rlang_1.1.1 readxl_1.4.2
## [16] curl_5.0.0 uuid_1.1-0 rstudioapi_0.14
## [19] data.table_1.14.8 jquerylib_0.1.4 klippy_0.0.0.9500
## [22] rmarkdown_2.21 textshaping_0.3.6 shiny_1.7.4
## [25] compiler_4.2.2 httpuv_1.6.11 xfun_0.39
## [28] pkgconfig_2.0.3 askpass_1.1 systemfonts_1.0.4
## [31] gfonts_0.2.0 htmltools_0.5.5 openssl_2.0.6
## [34] tidyselect_1.2.0 tibble_3.2.1 fontBitstreamVera_0.1.1
## [37] httpcode_0.3.0 fansi_1.0.4 withr_2.5.0
## [40] crayon_1.5.2 later_1.3.1 crul_1.4.0
## [43] grid_4.2.2 jsonlite_1.8.4 xtable_1.8-4
## [46] lifecycle_1.0.3 magrittr_2.0.3 zip_2.3.0
## [49] cli_3.6.1 stringi_1.7.12 cachem_1.0.8
## [52] promises_1.2.0.1 xml2_1.3.4 bslib_0.4.2
## [55] ellipsis_0.3.2 ragg_1.2.5 generics_0.1.3
## [58] vctrs_0.6.2 tools_4.2.2 glue_1.6.2
## [61] officer_0.6.2 fontquiver_0.2.1 purrr_1.0.1
## [64] fastmap_1.1.1 yaml_2.3.7 fontLiberation_0.1.0
## [67] rJava_1.0-6 knitr_1.43 sass_0.4.6
If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎