Introduction

This tutorial introduces dimension reduction methods with R with the aim of showcasing how these methods work, how to prepare data and how to implement selected dimension reduction methods (Principal Component Analysis, Factor Analysis, and Multidimensional Scaling). Unfortunately, we cannot deal with other, more complex dimension reduction methods such as Uniform Manifold Approximation and Projection (UMAP) or t-Distributed Stochastic Neighbor Embedding (t-SNE) in this tutorial (but we plan on adding a separate tutorial for these methods in the future). The reduction of multiple variables is useful for many things, e.g., for visualization, noise reduction, and simplifying complex data sets. However, it’s essential to note that interpretability of the principal components may not always be straightforward, as they are linear combinations of the original features.

This tutorial is aimed at beginners and intermediate users of R. The aim is not to provide a fully-fledged guide but rather to show and exemplify some common dimension reduction methods with R.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rproj file.

Here is a link to an interactive version of this tutorial on Binder. The interactive tutorial is based on a Jupyter notebook of this tutorial. This interactive Jupyter notebook allows you to execute code yourself and - if you copy the Jupyter notebook - you can also change and edit the notebook, e.g. you can change code and upload your own data.


To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following R Basics tutorials:

Click here1 to download the entire R Notebook for this tutorial.

Binder
Click here to open an interactive Jupyter notebook that allows you to execute, change, and edit the code as well as to upload your own data.


Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# install packages
install.packages("dplyr")
install.packages("stringr")
install.packages("writexl")
install.packages("here")
install.packages("flextable")
install.packages("tidyr")
install.packages("MASS")
install.packages("factoextra")
install.packages("ggplot2")
install.packages("report")
install.packages("psych")
install.packages("tufte")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we activate them as shown below.

After installing these packages, we fetch them from the library to activate the packages.

options(scipen = 999) # supress math. notation
library(dplyr)
library(here)
library(flextable)
library(stringr)
library(tidyr)
library(MASS)
library(factoextra)
library(ggplot2)
library(report)
library(psych)
library(tufte)

What are dimension reduction methods?

Dimension reduction methods such as Principal Component Analysis (PCA), Multidimensional Scaling (MDS), and Factor Analysis are techniques used to reduce the number of variables or dimensions in a data set while preserving as much relevant information as possible.

The choice of method depends on the specific goals of the analysis and the nature of the data being analyzed. Each method addresses different aspects of dimension reduction:

  • PCA emphasizes variance

  • MDS emphasizes pairwise distances

  • Factor Analysis emphasizes the underlying latent factors.

Below are brief explanations of the three commonly used dimension reduction methods.

Principal Component Analysis (PCA)

Principal Component Analysis is a statistical technique that transforms a data set into a new coordinate system where the variables (features) are linearly uncorrelated.

It aims to find a set of orthogonal axes (principal components) in such a way that the first principal component accounts for the maximum variance in the data, the second principal component for the second maximum variance, and so on. By selecting a subset of these principal components, you can achieve dimension reduction while retaining the most significant information in the data.

Unfortunately, PCA only works really well when the number of features (or variables) isn’t too big and when the majority of the variance is explained by 2 or 3 principal components. If you are dealing with a very large data set with many features, UMAP is a better alternative.

Multidimensional Scaling (MDS)

Multidimensional Scaling is a method used to represent high-dimensional data in a lower-dimensional space (usually 2D or 3D) while maintaining pairwise distances or similarities between data points. MDS attempts to preserve the relationships among data points as much as possible. It’s often used in visualization to represent complex data in a way that makes it easier to interpret or analyze.

Like PCA, MDS only works really well when the number of variables isn’t too big and when the majority of the variance is explained by 2 or 3 dimensions.

Factor Analysis

Factor Analysis is a statistical technique that aims to uncover underlying latent factors that contribute to the observed variables in a data set. It’s particularly useful when dealing with data where variables may be correlated, and you want to identify common factors that explain the shared variance. Factor Analysis assumes that observed variables are influenced by both the common factors and unique factors specific to each variable.

Principal Component Analysis

Principal Component Analysis (PCA) (see, e.g., Clark and Ma’ayan 2011) is used for dimensionality reduction and data compression while preserving as much variance as possible in the data. It achieves this by transforming the original data into a new coordinate system defined by a set of orthogonal axes called principal components. The first principal component captures the maximum variance in the data, the second captures the second maximum variance, and so on.

“Imagine you have just opened a cider shop. You have 50 varieties of cider and you want to work out how to allocate them onto shelves, so that similar-tasting ciders are put on the same shelf. There are lots of different tastes and textures in cider - sweetness, tartness, bitterness, yeastiness, fruitiness, clarity, fizziness etc etc. So what you need to do to put the bottles into categories is answer two questions:

  1. What qualities are most important for identifying groups of ciders? e.g. does classifying based on sweetness make it easier to cluster your ciders into similar-tasting groups than classifying based on fruitiness?

  2. Can we reduce our list of variables by combining some of them? e.g. is there actually a variable that is some combination of “yeastiness and clarity and fizziness” and which makes a really good scale for classifying varieties?

This is essentially what PCA does. Principal components are variables that usefully explain variation in a data set - in this case, that usefully differentiate between groups. Each principal component is one of your original explanatory variables, or a combination of some of your original explanatory variables.”
Freya Harrison

Let’s delve into an easy example to get a better understanding of the theoretical underpinnings of PCA and to see how it works (this section is based on Starmer 2018) and a very brief, nice, and easy to understand explanation of PCA is available here (Amoeba 2022)

Primer: Why should I use a PCA and what does PCA tell us?

Imagine we have a data set representing measurements of 4 features across a sample of 6 languages. We now want to see if the languages from groups based on the frequencies of these features.

Language

Feat1

Feat2

Feat3

Feat4

Lang1

10

6.0

12.0

5

Lang2

11

4.0

9.0

7

Lang3

8

5.0

10.0

6

Lang4

3

3.0

2.5

2

Lang5

1

2.8

1.3

4

Lang6

2

1.0

2.0

7

To understand how PCA works, let us start by plotting the first feature (Feat1).

When we check the visualization of the first feature, we see that there appear to be two groups in our data!

We now add the second feature (Feat2) to check if the second feature adds information and supports our hypothesis that there are two types of languages in our data based on the features we collected.

While adding the second feature has not added much information (in that sense, Feat2 not as distinctive), there still appear to be two groups. However, we continue by adding the third feature (Feat3) by adding size as a way to show the third feature.

Feat3 supports the two group hypothesis as small dots have values below 5 and big dots have values above 5 on the first feature scale. Finally, we add the fourth feature (Feat4) and use color to visualize this 4th dimension.

Like the second feature, Feat4 has not added much information.

Now, what would we use a PCA for in this context?

PCA can tell us if and how many groups there are in our data and it can tell use what features are responsible for the division into groups!

Let’s now go through a PCA step-by-step. Also, we only consider Feat1 and Feat2 to keep things very simple.

Language

Feat1

Feat2

Lang1

10

6.0

Lang2

11

4.0

Lang3

8

5.0

Lang4

3

3.0

Lang5

1

2.8

Lang6

2

1.0

We start by calculating the center of the data.

Next we scale and center the data so that the center is at (0,0).

The relationship between data points is unchanged!

Important: always center and scale your data as the scales will impact the magnitude of variance! In other words, if you do not center and scale, components can be deemed important simply because the scale of the variables is bigger (not because they are more important)!

Now, we fit a line through the data.

IMPORTANT!

But how do we arrive at this line and what is there always talk about “orthogonal” when talking about PCA?

Let’s have a look at simple the Ordinary Least Squared (OLS) procedure that underlies regression:

In OLS, the lines that represent variance (or residuals) are perpendicular to the x-axis.

However, in PCA, the line are perpendicular to the regression line as shown below!

The line that minimizes variance is called Principal Component 1 (PC1) and it is different from a regression line. The PC1 can be found by minimizing the orthogonal distances (blue lines above) between data points and their projections (green points above)!.

An alternative way to find PC1 is actually not to try and minimize the distance between the data point and its projection but to maximize the distance between the projection (the green points on the line) and the origin (the center of the coordinate system). We can then square the distances between projections and the origin and add these squared distances, which gives us the sum of squared distances or SS(distances).

PC1 has a slope of 0.9411239 - this already shows us that the data is spread much more along PC1 that along PC2!

We go about 1.0625593 units (1 / 0.9411) to the right and 1 unit up.

Now, we find the value for a (square root of b2 + c2) = 1.4591204.

Next, we scale b and c so that a is 1.

a2 = b2 + c2

b = 1.0625593 / 1.4591204 = 0.7282191

c = 1 / 1.4591204 = 0.6853444

To make PC1, we need 0.738 of Feat1 and 0.685 of Feat2.

This 1 unit long vector (a) is called the Eigenvector for PC1.

Eigenvector and Eigenvalue are not the same! Eigenvector represent the factor loadings for each observed variable on a specific factor. Eigenvalue represents the amount of variance explained by an entire factor.

The proportions (0.738 and 0.685) are called the loading scores.

PC2 is simply the perpendicular line to PC1 that goes through 0,0.

As PC1 and PC2 are perpendicular, the Eigenvector of PC2 is -0.685 of Feat1 and 0.728 of Feat2.

We can (but have to) rotate the PC1 and make the points smaller (so that our plot looks more like a proper PCA plot).

To plot the results of a PCA, we simply rotate the plot so that the PCs reflect the axes and we project the points onto the PCs.

# generate data
df <- data.frame(pcadat$Feat1,
                 pcadat$Feat2)
colnames(df) <- c("Feat1", "Feat2")
# perform PCA
pcaex <- prcomp(df, center = T, scale = T)
# visualize PCA results
fviz_pca_ind(pcaex, label =  "") +
  coord_cartesian(xlim = c(-2, 2), ylim = c(-2, 2)) +
  ggtitle("PCA plot") +
  theme(aspect.ratio = 1)

Now we have an understanding of how we arrive at the visual representation of a PCA. But how do we arrive at the explained variance?

In the context of PCA, variance refers to summative variance or overall/total variability.

To get a better grasp of what this means, let’s consider a covariance matrix of some 3 variables (rather than just 2 variables as above). Their variances are on the diagonal, and the sum of the 3 values (3.248) is the overall variability.

V1

V2

V3

1.3437305

-0.1601523

0.1864702

-0.1601523

0.6192056

-0.1266843

0.1864702

-0.1266843

1.4855496

Now, PCA replaces original variables with new variables, called principal components, which are orthogonal (i.e. they have zero covariation!) and have variances (called eigenvalues) in decreasing order. So, the covariance matrix between the principal components extracted from the above data is this:

V1

V2

V3

1.651354

0.000000

0.0000000

0.000000

1.220288

0.0000000

0.000000

0.000000

0.5768431

Note that the diagonal sum is still 3.448, which says that all 3 components account for all the multivariate variability. The PC1 accounts for or “explains” 1.651/3.448 = 47.9% of the overall variability; the PC2 explains 1.220/3.448 = 35.4% of it; the 3rd one explains .577/3.448 = 16.7% of it.

So, what do they mean when they say that PCA explains maximal variance? That is not, of course, that it finds the largest variance among three values 1.343730519 .619205620 1.485549631. PCA finds, in the data space, the dimension (direction) with the largest variance out of the overall variance 1.343730519+.619205620+1.485549631 = 3.448. That largest variance would be 1.651354285. Then it finds the dimension of the second largest variance, orthogonal to the first one, out of the remaining 3.448-1.651354285 overall variance. That 2nd dimension would be 1.220288343 variance. And so on. The last remaining dimension is .576843142 variance.

For our example, we can extract the amount of variance accounted for as shown below:

Step 1: get loading scores

pcaex$x
##             PC1        PC2
## [1,] -1.6229306  0.2698980
## [2,] -0.9855078 -0.6922526
## [3,] -0.8983094  0.1947325
## [4,]  0.7132969  0.2067653
## [5,]  1.1180034  0.4515144
## [6,]  1.6754475 -0.4306576

Step 2: calculate standard deviations of PC1 and PC2

sdpc1 <- sd(pcaex$x[,1])
sdpc2 <- sd(pcaex$x[,2])
sdpc1; sdpc2
## [1] 1.339995
## [1] 0.45212

Step 3: use standard deviations to calculate the amount of variance explained by each component

# Proportion of Variance PC1 = sd PC1 squared / (sd PC1 squared + sd PC2 squared)
sdpc1^2 / (sdpc1^2 + sdpc2^2) 
## [1] 0.8977938
# Proportion of Variance PC2 = sd PC2 squared / (sd PC1 squared + sd PC2 squared)
sdpc2^2 / (sdpc1^2 + sdpc2^2) 
## [1] 0.1022062

According to our manual calculation, PC1 explains 0.8978 % of the variation while PC2 explains 0.1022 % of the variation. To check if this is correct, we check our results against the results provided by the prcomp function.

## Importance of components:
##                           PC1    PC2
## Standard deviation     1.3400 0.4521
## Proportion of Variance 0.8978 0.1022
## Cumulative Proportion  0.8978 1.0000

But how do we arrive at the standard deviations of the PCs? The standard deviations of the PCs are the covariance matrix values for the respective PCs! So let’s have a look at the covariance matrices of the raw data, the scales data, and the rotated, final data.

##           Feat1    Feat2
## Feat1 18.966667 6.126667
## Feat2  6.126667 3.126667
##           Feat1     Feat2
## Feat1 1.0000000 0.7955875
## Feat2 0.7955875 1.0000000
##                          PC1                      PC2
## PC1 1.7955875066756534774726 0.0000000000000003981407
## PC2 0.0000000000000003981407 0.2044124933243466890609

We see that the standard deviation of PC1 is equal to the sum of the first row (or column) of the scaled covariance matrix. But more importantly, we see that the squared standard deviations of the PCs are equal to the variances (eigenvalues) of the PCs in the covariance matrix.

sdpc1^2 # sd of PC1 squared
## [1] 1.795588
# which is equal to
sum(cov(pcaex$x)[1,])
## [1] 1.795588
sdpc2^2 # sd of PC2 squared
## [1] 0.2044125
# which is equal to
sum(cov(pcaex$x)[2,])
## [1] 0.2044125

Now that we have worked through one example of a PCA manually, we turn to several examples to show how one can implement PCA in R to showcase what PCA can be used for.

Example 1

Here, we will show

  • how to use the prcomp() function to perform PCA

  • how to draw a PCA plot using ggplot2

  • how to determine how much variation each component accounts for

  • how to examine the loading scores to determine what variables have the largest effect on the graph

In this example, the data is in a matrix called pcadat where columns are individual samples (i.e. languages) and rows are measurements taken for all the samples (i.e. features).

pcadat <- matrix(nrow=100, ncol=10)
colnames(pcadat) <- c(
  paste("indoeu", 1:5, sep=""), # Indo-European languages
  paste("austro", 1:5, sep="")) # Austronesian languages
rownames(pcadat) <- paste("feature", 1:100, sep="")
for (i in 1:100) {
  indoeu.values <- rpois(5, lambda=sample(x=10:1000, size=1))
  austro.values <- rpois(5, lambda=sample(x=10:1000, size=1))
 
  pcadat[i,] <- c(indoeu.values, austro.values)
}
# inspect data
head(pcadat); dim(pcadat)
##          indoeu1 indoeu2 indoeu3 indoeu4 indoeu5 austro1 austro2 austro3
## feature1    1058     996     985    1020     958     759     745     820
## feature2     601     639     628     666     591     657     702     662
## feature3     792     831     736     762     770     319     346     340
## feature4     321     363     340     371     351     102      97      96
## feature5     152     141     134     149     121     422     420     409
## feature6     380     407     406     409     417     871     930     895
##          austro4 austro5
## feature1     766     797
## feature2     625     692
## feature3     352     339
## feature4     127      95
## feature5     423     417
## feature6     887     817
## [1] 100  10

We now implement the PCA using the prcomp function from the stats package. In our case, we specify that we want to transpose the data (using the t function) because we need features to represent columns (not rows!) and we set the argument scale to TRUE as the data has to be normalised for a PCA to provide reliable results.

pca <- prcomp(t(pcadat), scale=TRUE) 

Now, we generate a scree plot to show how much variance is accounted for by each component. We start by preparing the data.

# prepare data
pca.var <- pca$sdev^2
# extract percentage of contribution
pca.var.per <- round(pca.var/sum(pca.var)*100, 1)
# generate data frame for visualization
pcascreedat <- data.frame(Component = paste0("PC", 1:10),
                          Percent = round(pca.var/sum(pca.var)*100, 2)) %>%
  dplyr::mutate(Text = paste0(Percent, "%"),
                Component = factor(Component, 
                                    levels = c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10")))
# inspect
pcascreedat
##    Component Percent   Text
## 1        PC1   92.48 92.48%
## 2        PC2    2.29  2.29%
## 3        PC3    1.43  1.43%
## 4        PC4    1.07  1.07%
## 5        PC5    0.79  0.79%
## 6        PC6    0.61  0.61%
## 7        PC7    0.57  0.57%
## 8        PC8    0.51  0.51%
## 9        PC9    0.24  0.24%
## 10      PC10    0.00     0%

Now that the dtaa is formatted appropriately, we generate the scree plot.

ggplot(pcascreedat, aes(x = Component, y = Percent, label = Text)) +
  geom_bar(stat = "identity") +
  geom_text(vjust=-1.6, size = 3) +
  ggtitle("Scree Plot") +
  labs(x="Principal Components", y="Percent Variation") +
  theme_bw() +
  coord_cartesian(ylim = c(0, 100))

The scree plots shows that the PC1 accounts for 87.69% of the variability while the PC2 only accounts for 3.52% of the variability. This means that one component suffices to differentiate the data.

Now we make a fancy looking plot that shows the PCs and the variation.

data.frame(Sample=rownames(pca$x),
                       X=pca$x[,1],
                       Y=pca$x[,2]) %>%
  ggplot(aes(x=X, y=Y, label=Sample)) +
  geom_text() +
  xlab(paste("PC1 (", pca.var.per[1], "%)", sep="")) +
  ylab(paste("PC2 (", pca.var.per[2], "%)", sep="")) +
  theme_bw() +
  ggtitle("PCA Graph") +
  coord_cartesian(xlim = c(-11, 11))

Next, we get the names of the most important 5 features that contribute most to PC1.

loading_scores <- pca$rotation[,1]
# extract the magnitudes of the scores
feature_scores <- abs(loading_scores) 
# extract the 5 highest scores
feature_score_ranked <- sort(feature_scores, decreasing=TRUE)
top_5_features <- names(feature_score_ranked[1:5])
# show the names of the top 5 features 
top_5_features
## [1] "feature24" "feature62" "feature58" "feature44" "feature29"

Now, we show the scores (and +/- sign)

pca$rotation[top_5_features,1]
##  feature24  feature62  feature58  feature44  feature29 
##  0.1039671 -0.1039448 -0.1039315  0.1039010 -0.1039010

Example 2

In this example, we explore the use of PCA for survey data. Specifically, we want to use PCA to check if 5 items reflect the underlying factor appropriately or if one of the items needs to be replaced.

surveydata <- base::readRDS(url("https://slcladal.github.io/data/sud.rda", "rb"))
# inspect
flextable::flextable(head(surveydata, 10))

Respondent

Q01_Outgoing

Q02_Outgoing

Q03_Outgoing

Q04_Outgoing

Q05_Outgoing

Q06_Intelligence

Q07_Intelligence

Q08_Intelligence

Q09_Intelligence

Q10_Intelligence

Q11_Attitude

Q12_Attitude

Q13_Attitude

Q14_Attitude

Q15_Attitude

Respondent_01

4

5

4

4

5

2

3

3

2

2

3

3

2

3

3

Respondent_02

5

4

5

4

4

2

2

2

1

2

4

4

4

5

4

Respondent_03

5

4

4

5

5

2

1

1

2

2

5

4

4

4

4

Respondent_04

5

5

5

4

5

1

1

1

1

1

5

4

5

5

5

Respondent_05

4

5

4

5

5

2

2

1

2

1

4

5

4

5

5

Respondent_06

5

5

5

5

4

5

4

5

2

2

1

2

1

2

1

Respondent_07

4

5

4

5

5

4

5

4

4

5

2

1

1

2

1

Respondent_08

4

4

5

4

5

5

4

5

4

5

1

2

1

1

2

Respondent_09

5

5

4

4

4

5

5

4

4

5

1

2

2

1

2

Respondent_10

4

5

5

4

4

4

4

5

4

5

2

1

2

2

1

We now implement the PCA.

# entering raw data and extracting PCs  from the correlation matrix
pca_res2 <- princomp(surveydata[c("Q01_Outgoing", 
                                  "Q02_Outgoing",  
                                  "Q03_Outgoing",  
                                  "Q04_Outgoing",  
                                  "Q05_Outgoing")])
# print variance accounted for
summary(pca_res2) 
## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4      Comp.5
## Standard deviation     3.2317383 0.62656643 0.50633614 0.44585671 0.332592305
## Proportion of Variance 0.9159511 0.03442977 0.02248422 0.01743374 0.009701174
## Cumulative Proportion  0.9159511 0.95038087 0.97286509 0.99029883 1.000000000

The cumulative proportion of variance already shows that the five items represent a single underlying factor as 91.5% of the variance is explained by PC1 alone!

The loading scores for PC1 are very similar for all items and thus show that none of the items under performs. In such a case, it might make sense to remove items, not because they do not capture the underlying concept but to reduce the items for participants.

# prepare data
pca2.var <- pca_res2$sdev^2
# extract percentage of contribution
pca2.var.per <- round(pca.var/sum(pca.var)*100, 1)
# generate data frame for visualization
data.frame(Component = paste0("PC", 1:5),
                          Percent = round(pca2.var/sum(pca2.var)*100, 2)) %>%
  dplyr::mutate(Text = paste0(Percent, "%"),
                Component = factor(Component, 
                                    levels = c("PC1", "PC2", "PC3", "PC4", "PC5"))) %>%
  ggplot(aes(x = Component, y = Percent, label = Text)) +
  geom_bar(stat = "identity") +
  geom_text(vjust=-1.6, size = 3) +
  ggtitle("Scree Plot") +
  labs(x="Principal Components", y="Percent Variation") +
  theme_bw() +
  coord_cartesian(ylim = c(0, 100))

We now plot the loading scores for each participant to see if we can see groups among the participants based on outgoingness.

pca_res2$scores %>%
  as.data.frame() %>%
  dplyr::rename(PC1 = 1,
                PC2 = 2,
                PC3 = 3,
                PC4 = 4,
                PC5 = 5) %>%
  dplyr::mutate(Participants = paste0("P", 1:20),
                clr = ifelse(PC1 > 0, 1, 0)) %>%
  ggplot(aes(x = PC1, y = PC2, color = clr, label = Participants)) +
  geom_point(size=2) +
  geom_text(size = 3, nudge_x = -.2, check_overlap = T) +
  theme_bw() +
  scale_color_viridis_b() +
  theme(legend.position = "none")

The figure shows that based on PC1, we can distinguish between subjects who are very outgoign and subjects that are not very outgoing.

Example 3

In this example, we use the biopsy data to see what variables can be used to predict malignant melanomas.

In a first step, we remove any data points with missing values and add meaningful variable names.

data(biopsy)
biopsy_nona <- biopsy %>%
  tidyr::drop_na() %>%
  # add meaningful variable names
  dplyr::rename(ClumpSize = V1,
                CellSize = V2,
                CellShape = V3,
                Adhesion = V4,
                EpithelialSize = V5,
                Nuclei = V6,
                Chromatin = V7,
                Nucleoli = V8,
                Mitoses = V9)
# inspect data
flextable::flextable(head(biopsy_nona, 10))

ID

ClumpSize

CellSize

CellShape

Adhesion

EpithelialSize

Nuclei

Chromatin

Nucleoli

Mitoses

class

1000025

5

1

1

1

2

1

3

1

1

benign

1002945

5

4

4

5

7

10

3

2

1

benign

1015425

3

1

1

1

2

2

3

1

1

benign

1016277

6

8

8

1

3

4

3

7

1

benign

1017023

4

1

1

3

2

1

3

1

1

benign

1017122

8

10

10

8

7

10

9

7

1

malignant

1018099

1

1

1

1

2

10

3

1

1

benign

1018561

2

1

2

1

2

1

3

1

1

benign

1033078

2

1

1

1

2

1

1

1

5

benign

1033078

4

2

1

1

2

1

2

1

1

benign

Next, we remove non-numeric variables.

biopsy_num <- biopsy_nona %>%
  dplyr::select_if(is.numeric)
# inspect data
flextable::flextable(head(biopsy_num, 10))

ClumpSize

CellSize

CellShape

Adhesion

EpithelialSize

Nuclei

Chromatin

Nucleoli

Mitoses

5

1

1

1

2

1

3

1

1

5

4

4

5

7

10

3

2

1

3

1

1

1

2

2

3

1

1

6

8

8

1

3

4

3

7

1

4

1

1

3

2

1

3

1

1

8

10

10

8

7

10

9

7

1

1

1

1

1

2

10

3

1

1

2

1

2

1

2

1

3

1

1

2

1

1

1

2

1

1

1

5

4

2

1

1

2

1

2

1

1

We can perform the PCA.

biopsy_pca <- prcomp(biopsy_num,
                     scale = T)
# summary
summary(biopsy_pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
## Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
## Cumulative Proportion  0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
##                            PC8     PC9
## Standard deviation     0.51062 0.29729
## Proportion of Variance 0.02897 0.00982
## Cumulative Proportion  0.99018 1.00000

First, we create a scree plot of the variance captured by each component.

factoextra::fviz_eig(biopsy_pca,
                     addlabels = T,
                     ylim = c(0, 80),
                     barfill = "gray50",
                     barcolor = "gray20")

We now create a pretty biplot to show the results of the PCA.

fviz_pca_biplot(biopsy_pca,
                geom = "point",
                # no data point labels
                label = "var",
                # color by class
                habillage = biopsy_nona$class,
                # change variable color 
                col.var = "black") + 
  # adapt data point color
  ggplot2::scale_color_manual(values = c("orange", "purple")) +
  # add title
  ggtitle("A pretty biplot")

The plot above is also known as variable correlation plot or biplot. A biplot overlays a score plot with a loading plot and thus shows individual data points as well as the relationships between all variables (in the form of arrows). It can be interpreted as follow:

  • Positively correlated variables are grouped together.
  • Negatively correlated variables would be positioned on opposite sides of the plot origin (opposed quadrants).
  • The distance between variables and the origin measures the quality of the variables on the factor map.
  • Variables that are away from the origin are well represented on the factor map.

But why are all variables cluster together?

Let’s have a look at the correlations between variables.

ClumpSize

CellSize

CellShape

Adhesion

EpithelialSize

Nuclei

Chromatin

Nucleoli

Mitoses

1.0000000

0.6424815

0.6534700

0.4878287

0.5235960

0.5930914

0.5537424

0.5340659

0.3509572

0.6424815

1.0000000

0.9072282

0.7069770

0.7535440

0.6917088

0.7555592

0.7193460

0.4607547

0.6534700

0.9072282

1.0000000

0.6859481

0.7224624

0.7138775

0.7353435

0.7179634

0.4412576

0.4878287

0.7069770

0.6859481

1.0000000

0.5945478

0.6706483

0.6685671

0.6031211

0.4188983

0.5235960

0.7535440

0.7224624

0.5945478

1.0000000

0.5857161

0.6181279

0.6289264

0.4805833

0.5930914

0.6917088

0.7138775

0.6706483

0.5857161

1.0000000

0.6806149

0.5842802

0.3392104

0.5537424

0.7555592

0.7353435

0.6685671

0.6181279

0.6806149

1.0000000

0.6656015

0.3460109

0.5340659

0.7193460

0.7179634

0.6031211

0.6289264

0.5842802

0.6656015

1.0000000

0.4337573

0.3509572

0.4607547

0.4412576

0.4188983

0.4805833

0.3392104

0.3460109

0.4337573

1.0000000

The correlation matrix shows that all variables are positively correlated which explains why all variables cluster together. The low correlations between the other variables and Mitosis explain why it stands out.

Multidimensional Scaling

Multidimensional Scaling (MDS, see Davison (1983) or Jacoby and Ciuk (2018)) is used for visualizing high-dimensional data in a lower-dimensional space while preserving pairwise distances or similarities between data points. MDS aims to represent the data in a way that maintains the relationships among data points as much as possible.

There are two types of MDS:

  • metric or classical MDS: metric MDS aims to represent the distances between objects in a way that preserves the original dissimilarity/similarity relationships as accurately as possible. It assumes that the input data reflects actual distances or dissimilarities and it uses algorithms such as Principal Coordinate Analysis (PCoA) to transform the data into a lower-dimensional space while trying to maintain the original pairwise distances.

  • non-metric MDS: non-metric MDS aims to represent the rank order of the distances or dissimilarities between objects, rather than the actual distances. It is more suitable when the input data does not have a direct metric interpretation or when the distances are only known in terms of ordinal rankings.

MDS works exactly like PCA with one important difference: while PCA starts off with correlations between variables (the covariance matrix), MDS start of with distances (and thus relies on a distance matrix)!

Example 1

We begin by loading data. This fictitious data set represents responses to 15 items by 20 participants. The 15 items aim to assess 3 different psychological constructs:

  • outgoingness (extroversion)

  • intelligence

  • attitude

surveydata <- base::readRDS(url("https://slcladal.github.io/data/sud.rda", "rb"))
# inspect
report(surveydata)
## The data contains 20 observations of the following 16 variables:
## 
##   - Respondent: 20 entries, such as Respondent_01 (n = 1); Respondent_02 (n = 1);
## Respondent_03 (n = 1) and 17 others (0 missing)
##   - Q01_Outgoing: n = 20, Mean = 3.15, SD = 1.53, Median = 3.50, MAD = 2.22,
## range: [1, 5], Skewness = -0.18, Kurtosis = -1.52, 0 missing
##   - Q02_Outgoing: n = 20, Mean = 3.25, SD = 1.59, Median = 3.50, MAD = 2.22,
## range: [1, 5], Skewness = -0.10, Kurtosis = -1.73, 0 missing
##   - Q03_Outgoing: n = 20, Mean = 3.15, SD = 1.50, Median = 3.50, MAD = 2.22,
## range: [1, 5], Skewness = -0.07, Kurtosis = -1.58, 0 missing
##   - Q04_Outgoing: n = 20, Mean = 3.00, SD = 1.56, Median = 3.50, MAD = 2.22,
## range: [1, 5], Skewness = -0.09, Kurtosis = -1.64, 0 missing
##   - Q05_Outgoing: n = 20, Mean = 3.20, SD = 1.58, Median = 3.50, MAD = 2.22,
## range: [1, 5], Skewness = -0.19, Kurtosis = -1.59, 0 missing
##   - Q06_Intelligence: n = 20, Mean = 2.90, SD = 1.45, Median = 2.00, MAD = 1.48,
## range: [1, 5], Skewness = 0.31, Kurtosis = -1.43, 0 missing
##   - Q07_Intelligence: n = 20, Mean = 3.05, SD = 1.39, Median = 3.00, MAD = 1.48,
## range: [1, 5], Skewness = 0.03, Kurtosis = -1.24, 0 missing
##   - Q08_Intelligence: n = 20, Mean = 2.95, SD = 1.54, Median = 2.50, MAD = 2.22,
## range: [1, 5], Skewness = 0.19, Kurtosis = -1.54, 0 missing
##   - Q09_Intelligence: n = 20, Mean = 2.80, SD = 1.51, Median = 2.00, MAD = 1.48,
## range: [1, 5], Skewness = 0.38, Kurtosis = -1.48, 0 missing
##   - Q10_Intelligence: n = 20, Mean = 3.60, SD = 1.50, Median = 4.00, MAD = 1.48,
## range: [1, 5], Skewness = -0.58, Kurtosis = -1.33, 0 missing
##   - Q11_Attitude: n = 20, Mean = 2.75, SD = 1.62, Median = 2.50, MAD = 2.22,
## range: [1, 5], Skewness = 0.20, Kurtosis = -1.65, 0 missing
##   - Q12_Attitude: n = 20, Mean = 2.95, SD = 1.43, Median = 2.50, MAD = 2.22,
## range: [1, 5], Skewness = 0.22, Kurtosis = -1.40, 0 missing
##   - Q13_Attitude: n = 20, Mean = 2.75, SD = 1.65, Median = 2.00, MAD = 1.48,
## range: [1, 5], Skewness = 0.21, Kurtosis = -1.77, 0 missing
##   - Q14_Attitude: n = 20, Mean = 2.95, SD = 1.54, Median = 2.50, MAD = 2.22,
## range: [1, 5], Skewness = 0.19, Kurtosis = -1.54, 0 missing
##   - Q15_Attitude: n = 20, Mean = 2.95, SD = 1.70, Median = 2.50, MAD = 2.22,
## range: [1, 5], Skewness = 0.09, Kurtosis = -1.82, 0 missing

As MDS, like PCA and FA, works only on numeric data, we remove non-numeric variables from the data.

surveydata <- surveydata %>% 
  dplyr::select(-Respondent)
# inspect
str(surveydata)
## 'data.frame':    20 obs. of  15 variables:
##  $ Q01_Outgoing    : int  4 5 5 5 4 5 4 4 5 4 ...
##  $ Q02_Outgoing    : int  5 4 4 5 5 5 5 4 5 5 ...
##  $ Q03_Outgoing    : int  4 5 4 5 4 5 4 5 4 5 ...
##  $ Q04_Outgoing    : int  4 4 5 4 5 5 5 4 4 4 ...
##  $ Q05_Outgoing    : int  5 4 5 5 5 4 5 5 4 4 ...
##  $ Q06_Intelligence: int  2 2 2 1 2 5 4 5 5 4 ...
##  $ Q07_Intelligence: int  3 2 1 1 2 4 5 4 5 4 ...
##  $ Q08_Intelligence: int  3 2 1 1 1 5 4 5 4 5 ...
##  $ Q09_Intelligence: int  2 1 2 1 2 2 4 4 4 4 ...
##  $ Q10_Intelligence: int  2 2 2 1 1 2 5 5 5 5 ...
##  $ Q11_Attitude    : int  3 4 5 5 4 1 2 1 1 2 ...
##  $ Q12_Attitude    : int  3 4 4 4 5 2 1 2 2 1 ...
##  $ Q13_Attitude    : int  2 4 4 5 4 1 1 1 2 2 ...
##  $ Q14_Attitude    : int  3 5 4 5 5 2 2 1 1 2 ...
##  $ Q15_Attitude    : int  3 4 4 5 5 1 1 2 2 1 ...

For MDS, we first calculate the distance matrix using the Euclidian distance. Note that we are transposing, scaling and centering the data just like for PCA.

survey_dist <- dist(scale(t(surveydata), center=TRUE, scale=TRUE),
  method="euclidean")

Now, we generate an MDS object (this is basically eigenvalue decomposition)

mds.obj <- cmdscale(survey_dist, 
                    eig=TRUE,   # adds a matrix of eignevalues to the mds object
                    x.ret=TRUE) # adds a symmetric distance matrix to the mds object

Now, we calculate the percentage of variation that each MDS axis accounts for…

mds.var.per <- round(mds.obj$eig/sum(mds.obj$eig)*100, 1)
# inspect
mds.var.per
##  [1] 54.8 29.0  4.8  2.7  2.1  1.7  1.5  1.0  0.9  0.6  0.4  0.3  0.1  0.0  0.0

We make a fancy looking plot that shows the MDS axes and the variation:

mds.values <- mds.obj$points
mds.data <- data.frame(Sample=rownames(mds.values),
  X=mds.values[,1],
  Y=mds.values[,2])
# inspect
mds.data
##                            Sample          X          Y
## Q01_Outgoing         Q01_Outgoing  1.2441258 -2.9521948
## Q02_Outgoing         Q02_Outgoing  1.3832771 -3.2215742
## Q03_Outgoing         Q03_Outgoing  0.8464687 -2.8364303
## Q04_Outgoing         Q04_Outgoing  0.6884322 -3.1352690
## Q05_Outgoing         Q05_Outgoing  0.3742058 -3.2123840
## Q06_Intelligence Q06_Intelligence  3.7001253  1.1236968
## Q07_Intelligence Q07_Intelligence  3.4292730  1.8623706
## Q08_Intelligence Q08_Intelligence  4.0132660  1.5106590
## Q09_Intelligence Q09_Intelligence  3.8680562  2.5632774
## Q10_Intelligence Q10_Intelligence  1.5665476  3.9359300
## Q11_Attitude         Q11_Attitude -4.4171807  0.6011738
## Q12_Attitude         Q12_Attitude -3.8486419  1.2134235
## Q13_Attitude         Q13_Attitude -4.6650593  1.4927673
## Q14_Attitude         Q14_Attitude -3.8067797  0.1881540
## Q15_Attitude         Q15_Attitude -4.3761161  0.8663999
ggplot(data=mds.data, aes(x=X, y=Y, label=Sample)) +
  geom_text() +
  theme_bw() +
  xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) +
  ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) +
  ggtitle("MDS plot using Euclidean distance") +
  coord_cartesian(xlim = c(-7, 7), ylim = c(-4, 4))

The plot shows that we are dealing with 3 groups that are neatly separated along 2 dimensions. We also see that Q10 behaves different from the other items aiming to assess IQ. We could use this to optimize the item set of our survey.

Factor Analysis

Factor Analysis is used to identify underlying latent factors that explain the observed correlations among variables in a data set (Kim and Mueller 1978; Field, Miles, and Field 2012, 749–811). It aims to reduce the complexity of high-dimensional data by identifying a smaller number of factors (latent variables) that contribute to the variance and covariance among (observed) variables. Ideally those observed variables that represent an underlying factor should be highly correlated with each other but not correlated with (observed) variables that represent other factors (underlying variables). Like PCA and MDS, we find factors by aiming to reduce variability in the data but unlike PCA, which is based on a co-variance matrix, or MDS, which is based on a distance matrix, EFA is based on a correlation matrix.

There are two types of factor analysis:

  • exploratory factor analysis (EFA)

  • confirmatory factor analysis (CFA)

Here, we are only going to look at EFA which is used to identify how many latent variables (factors) are present in a data set. In contrast to CFA, EFA allows factors to be correlated with each other and it uses a rotation step to transform the initial factor structure into a more interpretable form. The rotation aims to maximize the loadings of variables on only a single factor and it thereby maximizes the separation between factors. Common rotation methods include Varimax, Oblimin, and Promax. The Varimax rotation is an orthogonal rotation and it is the most common rotation method. It assumes that the factors are not correlated; another type of rotation are oblique rotations, such as Oblimin and Promax, which do not assume that the factors are not correlated (Field, Miles, and Field 2012, 755).

CFA is used for hypothesis testing to confirm structures assumed to be present in the data. In R, we can use the sem or lavaan packages that implement Structural Equation Models (SEM)) to implement CFA and test specific hypothesis about the existence and interactions of latent factors.

Example 1

We continue with the survey data set and we can now implement the EFA using the factanal function from the base stats package. Although factanal implements a EFA, we need to provide it with the number of factors (latent variables) it should look for (commonly, you need to vary this to find the optimal number of factors).

As a first step, we generate a scree plot to assess how many factors are in our data. There are different approaches to interpret the output: (a) we choose the number of factors where there is a noticeable bend in the eigenvalues of the factors or, if there is no noticeable bend (the so-called point of inflexion (Field, Miles, and Field 2012, 763)), (b) select the last value above 1 (following Kaiser (1960) - however, Jolliffe (1987) suggests using .7 as 1 is too strict) (see Field, Miles, and Field 2012, 763 for more details on this discussion). Alternatively, we perform the analysis with different numbers of factors and then check if the SS loadings (the sum-of-squared loadings) of each factor has a value greater than 1.

# extract eigenvalues
eigenvalues <- eigen(cor(surveydata))
# inspect
#eigenvalues

# create scree plot
eigenvalues$values %>%
  as.data.frame() %>%
  dplyr::rename(Eigenvalues = 1) %>%
  dplyr::mutate(Factors = 1:nrow(.)) %>%
  ggplot(aes(x = Factors, y = Eigenvalues, label = round(Eigenvalues, 2))) +
  geom_bar(stat = "identity", fill = "lightgray") +
  geom_line(color = "red", size = 1) +
  geom_point() +
  geom_text(vjust = -1.2) +
  geom_hline(yintercept = 1, color = "blue", linetype = "dotted") +
  theme_bw() +
  coord_cartesian(ylim = c(0, 10)) +
  ggtitle("Scree plot")

As there is a noticeable bend in the Eigenvalues of the factors at 3 factors, we are justified by assuming that there are 3 factors (latent or hidden variables) in our data. We can now perform the factor analysis and set the number of expected factors to 3.

factoranalysis <- factanal(surveydata, # data
                           3) # number of factors
print(factoranalysis, 
      digits = 2,  # round to x decimals
      cutoff = .2, # do not show loadings smaller than .2
      sort = TRUE) # show items sorted
## 
## Call:
## factanal(x = surveydata, factors = 3)
## 
## Uniquenesses:
##     Q01_Outgoing     Q02_Outgoing     Q03_Outgoing     Q04_Outgoing 
##             0.09             0.06             0.12             0.07 
##     Q05_Outgoing Q06_Intelligence Q07_Intelligence Q08_Intelligence 
##             0.14             0.10             0.13             0.10 
## Q09_Intelligence Q10_Intelligence     Q11_Attitude     Q12_Attitude 
##             0.28             0.41             0.08             0.14 
##     Q13_Attitude     Q14_Attitude     Q15_Attitude 
##             0.04             0.09             0.06 
## 
## Loadings:
##                  Factor1 Factor2 Factor3
## Q06_Intelligence -0.82    0.25    0.41  
## Q07_Intelligence -0.80            0.47  
## Q08_Intelligence -0.85            0.42  
## Q09_Intelligence -0.79            0.29  
## Q11_Attitude      0.96                  
## Q12_Attitude      0.92                  
## Q13_Attitude      0.97                  
## Q14_Attitude      0.95                  
## Q15_Attitude      0.96                  
## Q01_Outgoing              0.94          
## Q02_Outgoing              0.96          
## Q03_Outgoing              0.93          
## Q04_Outgoing              0.96          
## Q05_Outgoing              0.92          
## Q10_Intelligence -0.22   -0.46    0.57  
## 
##                Factor1 Factor2 Factor3
## SS loadings       7.29    4.78    1.02
## Proportion Var    0.49    0.32    0.07
## Cumulative Var    0.49    0.80    0.87
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 62.79 on 63 degrees of freedom.
## The p-value is 0.484

Let us examine the output.

We start with the function call, followed by uniqueness scores. Uniqueness refers to the portion of the variance in an observed variable that is not explained by the underlying factors extracted from the data. It represents the variability in a variable that is unique to that variable and cannot be accounted for by the common factors identified in the analysis (uniqueness is the variance that is unique to that variable whereas commonness is the variance it shares with other variables or factors) (see Field, Miles, and Field 2012, 759). This means that items that do not capture or represent a latent variable well, will have high uniqueness scores (see, e.g., Q10_Intelligence).

Next, the output shows a table with factor loadings. Loadings are correlations between an observed variable and a latent factor. Loadings indicate how strongly each observed variable is associated with each extracted factor. They provide insights into which factors influence the variation in each observed variable. Factor loadings can be categorized as low (<0.4), acceptable (0.4–0.6), or high (>0.6 Stevens 1992; Tabachnick, Fidell, and Ullman 2007; Kline 2014). The fact that Q10_Intelligence loads across factors shows that it does not represent its intended factor (IQ) not very well.

We then get a table showing the sum-of-squares loadings with the amount of variance explained by each latent variable which can help identify the optimal or necessary number of latent variables (factors) that we should or need to look for. As stated above, all factors should have a value greater than 1 - if there is a value below 1 for the SS loadings, then this could mean that at least one factor could be removed.

The p-value should not be significant (if it is significant, then you have missed factors). In our case, 2 factors would also report a non significant result due to the low number of data points (but p is highest for 3 factors). If this test is significant, then this suggests that you may need more factors than you currently have in your PA (but be careful: more factors can also lead to situations where you get factors that are more difficult to interpret. So make sure that the factors you include make sense).

We can now plot the results to see if and how the different items group together. We start by created a data frame from the FA results containing the information we need for the visualization.

fadat <- data.frame(Factor1 = factoranalysis$loadings[,1],
                    Factor2 = factoranalysis$loadings[,2],
                    Items = names(surveydata))
# inspect
head(fadat)
##                      Factor1   Factor2            Items
## Q01_Outgoing     -0.13793094 0.9411952     Q01_Outgoing
## Q02_Outgoing     -0.15957196 0.9560680     Q02_Outgoing
## Q03_Outgoing     -0.06826263 0.9334254     Q03_Outgoing
## Q04_Outgoing     -0.05828165 0.9619694     Q04_Outgoing
## Q05_Outgoing      0.07146568 0.9228791     Q05_Outgoing
## Q06_Intelligence -0.82137535 0.2535663 Q06_Intelligence

Now, that the data is formatted appropriately, we can plot it.

fadat %>%
  ggplot(aes(x = Factor1, y = Factor2, label = Items, color = Factor1)) +
  geom_text() + 
  theme_bw() +
  coord_cartesian(xlim = c(-1.5, 1.5), ylim = c(-1, 1.5)) +
  scale_color_viridis_b() +
  labs(x = "Factor 1 (49 %)", y = "Factor 2 (32 %)")

As we can see, there are 3 groups (factors) with Q10_Intelligence not aligning well with the other elements in that group. This could suggest that Q10_Intelligence is not an optimal item / variable for reflecting (or capturing) the latent intelligence factor.

That’s all folks!

Citation & Session Info

Schweinberger, Martin. 2023. Introduction to Dimension Reduction Methods with R. The University of Queensland, Australia. School of Languages and Culture. url: www.ladal.edu.au/dimred.html (Version 2023.09.09).

@manual{schweinberger2023dimred,
  author = {Schweinberger, Martin},
  title = {Introduction to Dimension Reduction Methods with R},
  note = {www.ladal.edu.au/dimred.html},
    year = {2023},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2023.09.09}
}
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tufte_0.13       psych_2.3.6      report_0.5.7     factoextra_1.0.7
##  [5] ggplot2_3.4.2    MASS_7.3-60      tidyr_1.3.0      stringr_1.5.0   
##  [9] flextable_0.9.1  here_1.0.1       dplyr_1.1.2     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-163            fontquiver_0.2.1        insight_0.19.2         
##  [4] rprojroot_2.0.3         tools_4.2.2             backports_1.4.1        
##  [7] bslib_0.4.2             utf8_1.2.3              R6_2.5.1               
## [10] colorspace_2.1-0        withr_2.5.0             tidyselect_1.2.0       
## [13] mnormt_2.1.1            curl_5.0.0              compiler_4.2.2         
## [16] textshaping_0.3.6       cli_3.6.1               xml2_1.3.4             
## [19] officer_0.6.2           fontBitstreamVera_0.1.1 labeling_0.4.2         
## [22] sass_0.4.6              scales_1.2.1            askpass_1.1            
## [25] systemfonts_1.0.4       digest_0.6.31           rmarkdown_2.21         
## [28] gfonts_0.2.0            pkgconfig_2.0.3         htmltools_0.5.5        
## [31] fastmap_1.1.1           highr_0.10              rlang_1.1.1            
## [34] rstudioapi_0.14         httpcode_0.3.0          shiny_1.7.4            
## [37] jquerylib_0.1.4         farver_2.1.1            generics_0.1.3         
## [40] jsonlite_1.8.4          zip_2.3.0               car_3.1-2              
## [43] magrittr_2.0.3          Rcpp_1.0.10             munsell_0.5.0          
## [46] fansi_1.0.4             abind_1.4-5             gdtools_0.3.3          
## [49] lifecycle_1.0.3         stringi_1.7.12          yaml_2.3.7             
## [52] carData_3.0-5           grid_4.2.2              parallel_4.2.2         
## [55] promises_1.2.0.1        ggrepel_0.9.3           crayon_1.5.2           
## [58] lattice_0.21-8          knitr_1.43              pillar_1.9.0           
## [61] ggpubr_0.6.0            uuid_1.1-0              ggsignif_0.6.4         
## [64] crul_1.4.0              glue_1.6.2              evaluate_0.21          
## [67] fontLiberation_0.1.0    data.table_1.14.8       vctrs_0.6.2            
## [70] httpuv_1.6.11           gtable_0.3.3            openssl_2.0.6          
## [73] purrr_1.0.1             datawizard_0.7.1        cachem_1.0.8           
## [76] xfun_0.39               mime_0.12               xtable_1.8-4           
## [79] broom_1.0.4             rstatix_0.7.2           later_1.3.1            
## [82] ragg_1.2.5              viridisLite_0.4.2       tibble_3.2.1           
## [85] ellipsis_0.3.2

Back to top


References

Amoeba. 2022. “Making Sense of Principal Component Analysis, Eigenvectors &Amp; Eigenvalues.” Cross Validated. https://stats.stackexchange.com/q/140579.
Clark, Neil R, and Avi Ma’ayan. 2011. “Introduction to Statistical Methods to Analyze Large Data Sets: Principal Components Analysis.” Science Signaling 4 (190): tr3–3.
Davison, Mark L. 1983. “Introduction to Multidimensional Scaling and Its Applications.” Applied Psychological Measurement. Sage Publications Sage CA: Thousand Oaks, CA.
Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using r. Sage.
Jacoby, William G, and David J Ciuk. 2018. “Multidimensional Scaling: An Introduction.” The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test Development, 375–412.
Jolliffe, Ian T. 1987. “Rotation of Principal Components: Some Comments.” Journal of Climatology 7 (5): 507–10.
Kaiser, Henry F. 1960. “The Application of Electronic Computers to Factor Analysis.” Educational and Psychological Measurement 20 (1): 141–51.
Kim, Jae-On, and Charles W Mueller. 1978. Introduction to Factor Analysis: What It Is and How to Do It. 13. Sage.
Kline, Paul. 2014. An Easy Guide to Factor Analysis. Routledge.
Starmer, Josh. 2018. “StatQuest: Principal Component Analysis (PCA), Step-by-Step.” YouTube. https://www.youtube.com/watch?v=FgakZw6K1QQ.
Stevens, James. 1992. Applied Statistics for the Social Sciences. Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Tabachnick, Barbara G, Linda S Fidell, and Jodie B Ullman. 2007. Using Multivariate Statistics. 5th ed. Boston: Pearson.

  1. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.↩︎