Dimension Reduction Methods with R

Author

Martin Schweinberger

Introduction

This tutorial introduces dimension reduction methods with R. The aim is to show how these methods work, how to prepare data for them, and how to implement three of the most widely used techniques in linguistics and the social sciences: Principal Component Analysis (PCA), Multidimensional Scaling (MDS), and Exploratory Factor Analysis (EFA). Reducing many variables to a smaller number of meaningful dimensions is useful for visualisation, noise reduction, identifying latent structure, and simplifying complex data sets for further analysis.

This tutorial is aimed at beginners and intermediate users of R. The goal is not to provide an exhaustive statistical treatment but to give a solid conceptual grounding and practical hands-on experience with each method.

Prerequisite Tutorials
Learning Objectives

By the end of this tutorial you will be able to:

  1. Explain what dimension reduction is and why it is useful in linguistic and social science research
  2. Distinguish between PCA, MDS, and EFA and understand when each method is appropriate
  3. Understand the key concepts underlying PCA: eigenvectors, eigenvalues, principal components, loading scores, and explained variance
  4. Implement PCA in R using prcomp() and princomp(), interpret scree plots and biplots, and extract loading scores
  5. Implement MDS in R using cmdscale(), compute a distance matrix, and interpret an MDS plot
  6. Implement EFA in R using factanal(), determine the number of factors using scree plots and the Kaiser criterion, and interpret factor loadings and uniqueness scores
  7. Choose the most appropriate dimension reduction method for a given research question and data structure
Citation

Schweinberger, Martin. 2026. Dimension Reduction Methods with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/dimensionredux_tutorial/dimensionredux_tutorial.html (Version 2026.05.01).


Preparation and session set up

This tutorial requires several R packages. Run the installation code below if you have not yet installed them (you only need to do this once).

Code
install.packages("dplyr")
install.packages("stringr")
install.packages("flextable")
install.packages("tidyr")
install.packages("MASS")
install.packages("factoextra")
install.packages("ggplot2")
install.packages("report")
install.packages("psych")
install.packages("checkdown")

Once installed, load the packages for this session:

Code
options(scipen = 999)  # suppress scientific notation
library(dplyr)
library(flextable)
library(stringr)
library(tidyr)
library(MASS)
library(factoextra)
library(ggplot2)
library(report)
library(psych)

What Are Dimension Reduction Methods?

Section Overview

What you will learn: What dimension reduction is; why it is useful; the three main methods covered in this tutorial (PCA, MDS, FA); and the key difference between each

Dimension reduction methods such as Principal Component Analysis (PCA), Multidimensional Scaling (MDS), and Factor Analysis (FA) are techniques used to reduce the number of variables or dimensions in a data set while preserving as much relevant information as possible (Jolliffe 2002; Field, Miles, and Field 2012).

The need for dimension reduction arises because:

  • High-dimensional data is difficult to visualise directly — humans can intuitively perceive two or three dimensions, not fifty
  • Many variables in a real data set are correlated and therefore contain redundant information
  • Reducing dimensions removes noise and makes patterns more visible
  • Downstream statistical models often perform better on a smaller set of meaningful dimensions

Below are brief explanations of the three methods covered in this tutorial.

Principal Component Analysis (PCA)

PCA transforms a data set into a new coordinate system where the new variables (principal components) are linearly uncorrelated. It finds a set of orthogonal axes such that the first principal component captures the maximum variance in the data, the second captures the second maximum variance, and so on. By selecting only the top components, you achieve dimension reduction while retaining the most important information.

PCA is driven by the covariance matrix of the data. It is most effective when the dominant structure in the data can be captured by a small number of linear combinations of the original variables.

Multidimensional Scaling (MDS)

MDS represents high-dimensional data in a lower-dimensional space — usually 2D or 3D — while preserving pairwise distances or similarities between data points. Rather than starting from correlations between variables (as PCA does), MDS starts from a distance matrix between observations.

MDS is particularly useful for visualising the similarity structure of a set of items — for example, how similar a set of languages are to each other based on a profile of typological features, or how different survey respondents are from each other.

Factor Analysis (FA)

Factor Analysis aims to uncover underlying latent factors that explain the observed correlations among variables. It assumes that the observed variables are imperfect reflections of a smaller number of hidden (latent) constructs — for example, that responses to twenty survey items about language attitudes actually reflect three underlying attitude dimensions.

FA is driven by the correlation matrix of the variables. Unlike PCA (which is a mathematical transformation with no model assumptions), FA explicitly models each observed variable as a combination of common factors and a variable-specific unique component.

Key comparison

PCA MDS Factor Analysis
Starting point Covariance matrix Distance matrix Correlation matrix
Goal Maximise captured variance Preserve pairwise distances Identify latent factors
Assumptions None (pure transformation) None (pure transformation) Explicit factor model
Output Principal components Coordinates in reduced space Factor loadings + uniqueness
Typical use Data reduction, visualisation Similarity mapping Construct validation, surveys
Exercises: Overview

Q1. A researcher has collected measurements of 30 grammatical features across 50 languages and wants to visualise which languages are most similar to each other. Which dimension reduction method is most appropriate and why?






Principal Component Analysis

Section Overview

What you will learn: The conceptual foundations of PCA — centering and scaling, eigenvectors and eigenvalues, variance explained, loading scores; how PCA differs from ordinary least squares regression; and three worked examples in R

Principal Component Analysis (PCA) is used for dimensionality reduction and data compression while preserving as much variance as possible (Jolliffe 2002). It achieves this by transforming the original data into a new coordinate system defined by a set of orthogonal axes called principal components. The first principal component captures the maximum variance in the data, the second captures the second maximum variance, and so on.

A conceptual introduction to PCA

To make PCA tangible, consider a classic analogy. Imagine you have just opened a cider shop with 50 varieties of cider. You want to arrange them on shelves so that similar-tasting ciders are grouped together. There are many taste dimensions: sweetness, tartness, bitterness, yeastiness, fruitiness, fizziness, and so on. PCA helps you answer:

  1. Which combination of taste qualities is most useful for distinguishing groups of ciders?
  2. Can some taste variables be combined — because they tend to go together — into a single summary dimension?

This is essentially what PCA does: it finds new axes (principal components) that are combinations of the original variables and that explain the maximum amount of variation in the data. The first component is the single axis along which the data varies most; the second (orthogonal to the first) is the next most variable direction; and so on.

PCA can tell us if and how many groups there are in our data, and it can tell us what features are responsible for the division into groups.

Step-by-step: how PCA works

Let us work through a minimal example with just two features and six language observations to build intuition.

Language

Feat1

Feat2

Feat3

Feat4

Lang1

10

6.0

12.0

5

Lang2

11

4.0

9.0

7

Lang3

8

5.0

10.0

6

Lang4

3

3.0

2.5

2

Lang5

1

2.8

1.3

4

Lang6

2

1.0

2.0

7

When we plot Feat1 alone, we can already see two clusters:

Adding Feat2 on the y-axis:

Two groups are visible. The PCA procedure now finds the best line through these data points — but “best” here means something different from regression.

Why PCA differs from regression

In Ordinary Least Squares (OLS) regression, residuals are measured perpendicular to the x-axis. In PCA, residuals are measured perpendicular to the line itself. This is a crucial difference: PCA treats all variables symmetrically — there is no dependent variable — and finds the direction that minimises the orthogonal distances from points to the line.

Language

Feat1

Feat2

Lang1

10

6.0

Lang2

11

4.0

Lang3

8

5.0

Lang4

3

3.0

Lang5

1

2.8

Lang6

2

1.0

Step 1: Centre and scale

Always centre and scale your data before PCA

If you do not scale your variables, components will be driven by whichever variable has the largest raw scale — not because it is more important, but simply because its numbers are bigger. Centering sets the mean to zero; scaling sets each variable’s standard deviation to one. The relationship between data points is unchanged by this transformation.

After centering and scaling, the origin (0, 0) sits at the mean of the data.

Step 2: Find PC1

PC1 is the line through the origin that maximises the spread of the projected data points — equivalently, it minimises the sum of squared orthogonal distances between each point and the line. The direction of this line is the eigenvector of PC1; the amount of variance it captures is the eigenvalue.

The orange dashed line is PC1. The blue lines are the orthogonal distances from each point to PC1; the green points are the projections. PCA minimises the sum of squared blue line lengths — equivalently, it maximises the spread of the green projection points.

Step 3: Find PC2

PC2 is simply the line perpendicular to PC1 that also passes through the origin. In 2D there is only one such direction; in higher dimensions, PC2 is the direction orthogonal to PC1 that captures the most remaining variance.

Eigenvectors and eigenvalues

Eigenvector vs. Eigenvalue — two different things

Eigenvector: The direction of a principal component. For a 2-variable PCA, the eigenvector for PC1 specifies the proportion of each original variable that contributes to PC1. These proportions are the loading scores. For example, if the eigenvector is (0.728, 0.685), then PC1 is composed of 72.8% of Feat1 and 68.5% of Feat2.

Eigenvalue: The amount of variance captured by a principal component. Eigenvalues for PC1 are always larger than for PC2, PC2 larger than PC3, and so on. The proportion of total variance explained by PCk is its eigenvalue divided by the sum of all eigenvalues.

Step 4: Visualise — rotate to PC axes

The final PCA plot is simply the data projected onto the PC axes. Rotating so that PC1 is horizontal and PC2 is vertical gives the familiar PCA scatter plot:

We can verify this against prcomp():

Code
df <- data.frame(Feat1 = pcadat_scaled$Feat1,
                 Feat2 = pcadat_scaled$Feat2)
pcaex <- prcomp(df, center = TRUE, scale = TRUE)
fviz_pca_ind(pcaex, label = "") +
  coord_cartesian(xlim = c(-2, 2), ylim = c(-2, 2)) +
  ggtitle("PCA plot") +
  theme(aspect.ratio = 1)

How is explained variance calculated?

Each PC has an associated eigenvalue representing the variance it captures. The proportion of variance explained by PCk is simply its eigenvalue divided by the total sum of all eigenvalues.

In the covariance matrix of the PCA-rotated data, the off-diagonal elements are all zero (the components are uncorrelated by construction), and the diagonal elements are the eigenvalues of the components. The total variance is preserved:

Code
sdpc1 <- sd(pcaex$x[, 1])
sdpc2 <- sd(pcaex$x[, 2])
# Proportion of variance explained by PC1
sdpc1^2 / (sdpc1^2 + sdpc2^2)
[1] 0.8978
Code
# Proportion of variance explained by PC2
sdpc2^2 / (sdpc1^2 + sdpc2^2)
[1] 0.1022

Cross-check against prcomp:

Importance of components:
                         PC1   PC2
Standard deviation     1.340 0.452
Proportion of Variance 0.898 0.102
Cumulative Proportion  0.898 1.000

Example 1: PCA on simulated language data

This example shows how to use prcomp(), generate a scree plot, create a PCA scatter plot, and extract the features that contribute most to PC1.

We simulate data representing 100 features measured across 10 languages (5 Indo-European, 5 Austronesian). Languages within each family are expected to be more similar to each other.

Code
set.seed(123)
pcadat <- matrix(nrow = 100, ncol = 10)
colnames(pcadat) <- c(paste0("indoeu", 1:5), paste0("austro", 1:5))
rownames(pcadat) <- paste0("feature", 1:100)

for (i in 1:100) {
  indoeu.values <- rpois(5, lambda = sample(10:1000, 1))
  austro.values <- rpois(5, lambda = sample(10:1000, 1))
  pcadat[i, ]   <- c(indoeu.values, austro.values)
}

head(pcadat)
         indoeu1 indoeu2 indoeu3 indoeu4 indoeu5 austro1 austro2 austro3
feature1     440     448     389     426     459     386     408     349
feature2    1015    1005     964    1013     935     865     873     849
feature3     834     797     820     821     826     863     884     866
feature4     778     751     789     725     767      25      21      23
feature5     552     521     509     520     551     475     478     456
feature6     378     386     368     326     356     769     752     734
         austro4 austro5
feature1     406     390
feature2     877     829
feature3     858     832
feature4      21      21
feature5     492     495
feature6     757     749

We transpose the matrix (features must be columns) and run PCA with scaling:

Code
pca <- prcomp(t(pcadat), scale = TRUE)

Scree plot — how much variance does each component explain?

Code
pca.var     <- pca$sdev^2
pca.var.per <- round(pca.var / sum(pca.var) * 100, 1)

data.frame(
  Component = factor(paste0("PC", 1:10), levels = paste0("PC", 1:10)),
  Percent   = round(pca.var / sum(pca.var) * 100, 2)
) |>
  dplyr::mutate(Text = paste0(Percent, "%")) |>
  ggplot(aes(x = Component, y = Percent, label = Text)) +
  geom_bar(stat = "identity", fill = "gray60") +
  geom_text(vjust = -1.6, size = 3) +
  ggtitle("Scree Plot") +
  labs(x = "Principal Components", y = "Percent Variation") +
  theme_bw() +
  coord_cartesian(ylim = c(0, 100))

PC1 dominates, explaining the large majority of variance — one component is enough to separate the two language families. The PCA plot confirms this:

Code
data.frame(
  Sample = rownames(pca$x),
  X      = pca$x[, 1],
  Y      = pca$x[, 2]
) |>
  ggplot(aes(x = X, y = Y, label = Sample)) +
  geom_text() +
  xlab(paste0("PC1 (", pca.var.per[1], "%)")) +
  ylab(paste0("PC2 (", pca.var.per[2], "%)")) +
  theme_bw() +
  ggtitle("PCA: Indo-European vs. Austronesian") +
  coord_cartesian(xlim = c(-11, 11))

Which features drive PC1? Extract the five features with the highest (absolute) loadings:

Code
loading_scores      <- pca$rotation[, 1]
feature_scores      <- abs(loading_scores)
feature_score_ranked <- sort(feature_scores, decreasing = TRUE)
top_5_features      <- names(feature_score_ranked[1:5])
top_5_features
[1] "feature71" "feature9"  "feature82" "feature30" "feature79"
Code
# Show signed scores (direction matters: positive vs. negative loading)
pca$rotation[top_5_features, 1]
feature71  feature9 feature82 feature30 feature79 
   0.1055   -0.1055   -0.1055    0.1055   -0.1055 

Example 2: PCA for survey item validation

Here we use PCA to check whether 5 Likert-scale items all capture the same underlying construct (outgoingness / extraversion). If they do, PC1 should explain a very large proportion of variance, and all five items should load similarly on PC1.

Code
surveydata <- base::readRDS("tutorials/dimensionredux_tutorial/data/sud.rda", "rb")
flextable::flextable(head(surveydata, 10))

Respondent

Q01_Outgoing

Q02_Outgoing

Q03_Outgoing

Q04_Outgoing

Q05_Outgoing

Q06_Intelligence

Q07_Intelligence

Q08_Intelligence

Q09_Intelligence

Q10_Intelligence

Q11_Attitude

Q12_Attitude

Q13_Attitude

Q14_Attitude

Q15_Attitude

Respondent_01

4

5

4

4

5

2

3

3

2

2

3

3

2

3

3

Respondent_02

5

4

5

4

4

2

2

2

1

2

4

4

4

5

4

Respondent_03

5

4

4

5

5

2

1

1

2

2

5

4

4

4

4

Respondent_04

5

5

5

4

5

1

1

1

1

1

5

4

5

5

5

Respondent_05

4

5

4

5

5

2

2

1

2

1

4

5

4

5

5

Respondent_06

5

5

5

5

4

5

4

5

2

2

1

2

1

2

1

Respondent_07

4

5

4

5

5

4

5

4

4

5

2

1

1

2

1

Respondent_08

4

4

5

4

5

5

4

5

4

5

1

2

1

1

2

Respondent_09

5

5

4

4

4

5

5

4

4

5

1

2

2

1

2

Respondent_10

4

5

5

4

4

4

4

5

4

5

2

1

2

2

1

Code
pca_res2 <- princomp(surveydata[c(
  "Q01_Outgoing", "Q02_Outgoing", "Q03_Outgoing",
  "Q04_Outgoing", "Q05_Outgoing"
)])
summary(pca_res2)
Importance of components:
                       Comp.1  Comp.2  Comp.3  Comp.4   Comp.5
Standard deviation      3.232 0.62657 0.50634 0.44586 0.332592
Proportion of Variance  0.916 0.03443 0.02248 0.01743 0.009701
Cumulative Proportion   0.916 0.95038 0.97287 0.99030 1.000000

PC1 alone explains over 91% of the variance — a very strong result suggesting the five items reliably reflect a single underlying dimension. The scree plot makes this visual:

Code
pca2.var <- pca_res2$sdev^2

data.frame(
  Component = factor(paste0("PC", 1:5), levels = paste0("PC", 1:5)),
  Percent   = round(pca2.var / sum(pca2.var) * 100, 2)
) |>
  dplyr::mutate(Text = paste0(Percent, "%")) |>
  ggplot(aes(x = Component, y = Percent, label = Text)) +
  geom_bar(stat = "identity", fill = "gray60") +
  geom_text(vjust = -1.6, size = 3) +
  ggtitle("Scree Plot: Outgoingness items") +
  labs(x = "Principal Components", y = "Percent Variation") +
  theme_bw() +
  coord_cartesian(ylim = c(0, 100))

We can now visualise individual respondents’ scores to see who is more or less outgoing:

Code
pca_res2$scores |>
  as.data.frame() |>
  dplyr::rename(PC1 = 1, PC2 = 2, PC3 = 3, PC4 = 4, PC5 = 5) |>
  dplyr::mutate(
    Participants = paste0("P", 1:20),
    clr          = ifelse(PC1 > 0, 1, 0)
  ) |>
  ggplot(aes(x = PC1, y = PC2, color = clr, label = Participants)) +
  geom_point(size = 2) +
  geom_text(size = 3, nudge_x = -0.2, check_overlap = TRUE) +
  theme_bw() +
  scale_color_viridis_b() +
  theme(legend.position = "none") +
  ggtitle("PC1 distinguishes more outgoing from less outgoing participants")

Example 3: PCA on biopsy data

This example applies PCA to the biopsy dataset from the MASS package. The data contains 9 cytological measurements for biopsy samples, each labelled as either benign or malignant. PCA can reveal whether the measurements naturally separate the two classes.

Code
data(biopsy)
biopsy_nona <- biopsy |>
  tidyr::drop_na() |>
  dplyr::rename(
    ClumpSize     = V1, CellSize     = V2, CellShape  = V3,
    Adhesion      = V4, EpithelialSize = V5, Nuclei   = V6,
    Chromatin     = V7, Nucleoli     = V8, Mitoses    = V9
  )
flextable::flextable(head(biopsy_nona, 10))

ID

ClumpSize

CellSize

CellShape

Adhesion

EpithelialSize

Nuclei

Chromatin

Nucleoli

Mitoses

class

1000025

5

1

1

1

2

1

3

1

1

benign

1002945

5

4

4

5

7

10

3

2

1

benign

1015425

3

1

1

1

2

2

3

1

1

benign

1016277

6

8

8

1

3

4

3

7

1

benign

1017023

4

1

1

3

2

1

3

1

1

benign

1017122

8

10

10

8

7

10

9

7

1

malignant

1018099

1

1

1

1

2

10

3

1

1

benign

1018561

2

1

2

1

2

1

3

1

1

benign

1033078

2

1

1

1

2

1

1

1

5

benign

1033078

4

2

1

1

2

1

2

1

1

benign

Code
biopsy_num <- biopsy_nona |> dplyr::select_if(is.numeric)
biopsy_pca <- prcomp(biopsy_num, scale = TRUE)
summary(biopsy_pca)
Importance of components:
                         PC1    PC2    PC3    PC4    PC5    PC6    PC7   PC8
Standard deviation     2.429 0.8809 0.7343 0.6780 0.6167 0.5494 0.5426 0.511
Proportion of Variance 0.655 0.0862 0.0599 0.0511 0.0423 0.0335 0.0327 0.029
Cumulative Proportion  0.655 0.7417 0.8016 0.8527 0.8950 0.9285 0.9612 0.990
                           PC9
Standard deviation     0.29729
Proportion of Variance 0.00982
Cumulative Proportion  1.00000

The scree plot shows that PC1 alone explains approximately 65% of the variance:

Code
factoextra::fviz_eig(biopsy_pca,
  addlabels = TRUE, ylim = c(0, 80),
  barfill = "gray50", barcolor = "gray20"
)
Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
Ignoring empty aesthetic: `width`.

The biplot overlays both data points (coloured by diagnosis) and variable arrows, allowing us to see simultaneously how samples cluster and which variables drive the separation:

Code
fviz_pca_biplot(biopsy_pca,
  geom    = "point",
  label   = "var",
  habillage = biopsy_nona$class,
  col.var = "black"
) +
  ggplot2::scale_color_manual(values = c("orange", "purple")) +
  ggtitle("Biplot: biopsy data coloured by class")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.

The biplot can be interpreted as follows:

  • Positively correlated variables cluster together (their arrows point in the same direction)
  • Negatively correlated variables point in opposite directions from the origin
  • Variables further from the origin are better represented on the plot (higher quality of representation)
  • The two classes (benign in orange, malignant in purple) separate clearly along PC1

Why do nearly all variable arrows cluster together? Because almost all variables are positively correlated — samples with larger cell sizes tend to have larger nuclei, more adhesion, etc. The correlation matrix confirms this:

ClumpSize

CellSize

CellShape

Adhesion

EpithelialSize

Nuclei

Chromatin

Nucleoli

Mitoses

1.00

0.64

0.65

0.49

0.52

0.59

0.55

0.53

0.35

0.64

1.00

0.91

0.71

0.75

0.69

0.76

0.72

0.46

0.65

0.91

1.00

0.69

0.72

0.71

0.74

0.72

0.44

0.49

0.71

0.69

1.00

0.59

0.67

0.67

0.60

0.42

0.52

0.75

0.72

0.59

1.00

0.59

0.62

0.63

0.48

0.59

0.69

0.71

0.67

0.59

1.00

0.68

0.58

0.34

0.55

0.76

0.74

0.67

0.62

0.68

1.00

0.67

0.35

0.53

0.72

0.72

0.60

0.63

0.58

0.67

1.00

0.43

0.35

0.46

0.44

0.42

0.48

0.34

0.35

0.43

1.00

Mitoses stands out with lower correlations to the other variables, which is why its arrow diverges from the main cluster in the biplot.

Exercises: PCA

Q2. After running a PCA on a corpus of 500 texts described by 50 frequency-based features, a researcher finds that PC1 explains 72% of the variance and PC2 explains 11%. She decides to retain only PC1 for downstream analysis. Is this decision justified? What information does she lose?






Multidimensional Scaling

Section Overview

What you will learn: What MDS is and how it differs from PCA; metric vs. non-metric MDS; how to compute a distance matrix and run MDS in R; and how to interpret an MDS plot

Multidimensional Scaling (MDS) is used to visualise high-dimensional data in a lower-dimensional space — usually 2D or 3D — while preserving pairwise distances or similarities between data points (Davison 1983). MDS aims to represent the similarity structure of a set of items as accurately as possible.

How MDS differs from PCA

MDS and PCA both result in a low-dimensional plot, but they start from different inputs:

  • PCA starts from the covariance matrix of the variables and finds linear combinations that maximise variance
  • MDS starts from a distance matrix between observations and finds a configuration in low-dimensional space where the distances between points match the original distances as closely as possible

MDS is often used when:

  • You have a precomputed distance or dissimilarity matrix (rather than raw variable measurements)
  • Your research question is explicitly about similarities between cases (e.g. “which languages are most similar to each other?”)
  • You want to visualise the proximity structure without assuming linear structure in the data

Metric vs. non-metric MDS

Metric (classical) MDS — also known as Principal Coordinate Analysis (PCoA) — assumes that the distance matrix reflects true metric distances and aims to reproduce those distances exactly. This is implemented in R via cmdscale().

Non-metric MDS — preserves the rank order of distances rather than the distances themselves. It is appropriate when your distances are ordinal (e.g. similarity ratings) rather than true numerical distances. This is implemented via isoMDS() in the MASS package.

For most corpus linguistics and survey applications, metric MDS is the natural starting point.

Example: MDS on survey data

We continue with the same survey data set (15 items assessing three constructs: outgoingness, intelligence, and attitude across 20 participants). MDS will reveal whether the 15 items group into three interpretable clusters.

Code
surveydata <- base::readRDS("tutorials/dimensionredux_tutorial/data/sud.rda", "rb")
report(surveydata)
The data contains 20 observations of the following 16 variables:

  - Respondent: 20 entries, such as Respondent_01 (n = 1); Respondent_02 (n = 1);
Respondent_03 (n = 1) and 17 others (0 missing)
  - Q01_Outgoing: n = 20, Mean = 3.15, SD = 1.53, Median = 3.50, MAD = 2.22,
range: [1, 5], Skewness = -0.18, Kurtosis = -1.52, 0 missing
  - Q02_Outgoing: n = 20, Mean = 3.25, SD = 1.59, Median = 3.50, MAD = 2.22,
range: [1, 5], Skewness = -0.10, Kurtosis = -1.73, 0 missing
  - Q03_Outgoing: n = 20, Mean = 3.15, SD = 1.50, Median = 3.50, MAD = 2.22,
range: [1, 5], Skewness = -0.07, Kurtosis = -1.58, 0 missing
  - Q04_Outgoing: n = 20, Mean = 3.00, SD = 1.56, Median = 3.50, MAD = 2.22,
range: [1, 5], Skewness = -0.09, Kurtosis = -1.64, 0 missing
  - Q05_Outgoing: n = 20, Mean = 3.20, SD = 1.58, Median = 3.50, MAD = 2.22,
range: [1, 5], Skewness = -0.19, Kurtosis = -1.59, 0 missing
  - Q06_Intelligence: n = 20, Mean = 2.90, SD = 1.45, Median = 2.00, MAD = 1.48,
range: [1, 5], Skewness = 0.31, Kurtosis = -1.43, 0 missing
  - Q07_Intelligence: n = 20, Mean = 3.05, SD = 1.39, Median = 3.00, MAD = 1.48,
range: [1, 5], Skewness = 0.03, Kurtosis = -1.24, 0 missing
  - Q08_Intelligence: n = 20, Mean = 2.95, SD = 1.54, Median = 2.50, MAD = 2.22,
range: [1, 5], Skewness = 0.19, Kurtosis = -1.54, 0 missing
  - Q09_Intelligence: n = 20, Mean = 2.80, SD = 1.51, Median = 2.00, MAD = 1.48,
range: [1, 5], Skewness = 0.38, Kurtosis = -1.48, 0 missing
  - Q10_Intelligence: n = 20, Mean = 3.60, SD = 1.50, Median = 4.00, MAD = 1.48,
range: [1, 5], Skewness = -0.58, Kurtosis = -1.33, 0 missing
  - Q11_Attitude: n = 20, Mean = 2.75, SD = 1.62, Median = 2.50, MAD = 2.22,
range: [1, 5], Skewness = 0.20, Kurtosis = -1.65, 0 missing
  - Q12_Attitude: n = 20, Mean = 2.95, SD = 1.43, Median = 2.50, MAD = 2.22,
range: [1, 5], Skewness = 0.22, Kurtosis = -1.40, 0 missing
  - Q13_Attitude: n = 20, Mean = 2.75, SD = 1.65, Median = 2.00, MAD = 1.48,
range: [1, 5], Skewness = 0.21, Kurtosis = -1.77, 0 missing
  - Q14_Attitude: n = 20, Mean = 2.95, SD = 1.54, Median = 2.50, MAD = 2.22,
range: [1, 5], Skewness = 0.19, Kurtosis = -1.54, 0 missing
  - Q15_Attitude: n = 20, Mean = 2.95, SD = 1.70, Median = 2.50, MAD = 2.22,
range: [1, 5], Skewness = 0.09, Kurtosis = -1.82, 0 missing

Remove the non-numeric respondent identifier:

Code
surveydata <- surveydata |> dplyr::select(-Respondent)
str(surveydata)
'data.frame':   20 obs. of  15 variables:
 $ Q01_Outgoing    : int  4 5 5 5 4 5 4 4 5 4 ...
 $ Q02_Outgoing    : int  5 4 4 5 5 5 5 4 5 5 ...
 $ Q03_Outgoing    : int  4 5 4 5 4 5 4 5 4 5 ...
 $ Q04_Outgoing    : int  4 4 5 4 5 5 5 4 4 4 ...
 $ Q05_Outgoing    : int  5 4 5 5 5 4 5 5 4 4 ...
 $ Q06_Intelligence: int  2 2 2 1 2 5 4 5 5 4 ...
 $ Q07_Intelligence: int  3 2 1 1 2 4 5 4 5 4 ...
 $ Q08_Intelligence: int  3 2 1 1 1 5 4 5 4 5 ...
 $ Q09_Intelligence: int  2 1 2 1 2 2 4 4 4 4 ...
 $ Q10_Intelligence: int  2 2 2 1 1 2 5 5 5 5 ...
 $ Q11_Attitude    : int  3 4 5 5 4 1 2 1 1 2 ...
 $ Q12_Attitude    : int  3 4 4 4 5 2 1 2 2 1 ...
 $ Q13_Attitude    : int  2 4 4 5 4 1 1 1 2 2 ...
 $ Q14_Attitude    : int  3 5 4 5 5 2 2 1 1 2 ...
 $ Q15_Attitude    : int  3 4 4 5 5 1 1 2 2 1 ...

Compute the Euclidean distance matrix. We transpose, scale, and centre for the same reasons as in PCA — to ensure all items are on a comparable scale before computing distances:

Code
survey_dist <- dist(
  scale(t(surveydata), center = TRUE, scale = TRUE),
  method = "euclidean"
)

Run classical MDS (eigenvalue decomposition of the distance matrix):

Code
mds.obj <- cmdscale(
  survey_dist,
  eig    = TRUE,  # include eigenvalues
  x.ret  = TRUE   # include the distance matrix
)

Calculate the percentage of variation explained by each MDS axis:

Code
mds.var.per <- round(mds.obj$eig / sum(mds.obj$eig) * 100, 1)
mds.var.per
 [1] 54.8 29.0  4.8  2.7  2.1  1.7  1.5  1.0  0.9  0.6  0.4  0.3  0.1  0.0  0.0

Build and plot the MDS map:

Code
mds.values <- mds.obj$points
mds.data   <- data.frame(
  Sample = rownames(mds.values),
  X      = mds.values[, 1],
  Y      = mds.values[, 2]
)

ggplot(data = mds.data, aes(x = X, y = Y, label = Sample)) +
  geom_text() +
  theme_bw() +
  xlab(paste0("MDS1 — ", mds.var.per[1], "%")) +
  ylab(paste0("MDS2 — ", mds.var.per[2], "%")) +
  ggtitle("MDS plot using Euclidean distance") +
  coord_cartesian(xlim = c(-7, 7), ylim = c(-4, 4))

The plot shows three well-separated clusters, corresponding to the three constructs assessed by the survey. Items within each construct are placed close together; items from different constructs are far apart. Notably, Q10_Intelligence sits apart from the other intelligence items, suggesting it may not measure the same construct as the others — a potentially important finding for survey refinement.

Exercises: MDS

Q3. In the MDS plot above, Q10_Intelligence is placed noticeably further from the other intelligence items than any item is from its own group. What does this tell us, and what should the researcher do?






Factor Analysis

Section Overview

What you will learn: What Factor Analysis is and how it differs from PCA; exploratory (EFA) vs. confirmatory (CFA) factor analysis; how to determine the number of factors using scree plots and the Kaiser criterion; how to implement EFA in R with factanal(); and how to interpret uniqueness scores, factor loadings, and the significance test

Factor Analysis is used to identify underlying latent factors that explain the observed correlations among variables in a data set (Kim and Mueller 1978; Field, Miles, and Field 2012). It reduces complexity by identifying a smaller number of latent variables (factors) that account for the shared variance among observed variables. Ideally, variables representing the same underlying factor should be highly correlated with each other and uncorrelated with variables representing other factors.

EFA vs. CFA

Exploratory Factor Analysis (EFA) — used when you do not know in advance how many factors are present or which variables load on which factors. EFA lets the data determine the factor structure. This is what we cover in this tutorial.

Confirmatory Factor Analysis (CFA) — used when you have a prior hypothesis about the factor structure and want to test whether your data are consistent with that structure. CFA is implemented using Structural Equation Modelling (SEM) in R via the lavaan package. See the Structural Equation Modelling tutorial for details.

Rotation in EFA

After extracting an initial factor solution, EFA applies a rotation step to transform the factor structure into a more interpretable form. Rotation aims to achieve simple structure: each variable loads highly on only one factor, and each factor is defined by a distinct subset of variables.

  • Varimax — an orthogonal rotation that assumes the factors are uncorrelated. It maximises the variance of squared loadings within each factor, pushing loadings toward 0 or ±1. This is the default and most widely used rotation method (Field, Miles, and Field 2012, 755).
  • Oblimin / Promax — oblique rotations that allow factors to be correlated. More appropriate when you expect the underlying constructs to be related to each other (e.g. different aspects of language proficiency).

Determining the number of factors

Before running EFA, you need to decide how many factors to extract. Two common approaches:

  1. Scree plot — plot eigenvalues in descending order and look for a “point of inflexion” (elbow) where the slope flattens. The number of factors is the number of eigenvalues above the elbow.

  2. Kaiser criterion (Kaiser 1960) — retain all factors with eigenvalues greater than 1. The rationale is that a factor should explain at least as much variance as a single variable. Note that Field, Miles, and Field (2012) (p. 763) and others caution that the Kaiser criterion can over-extract factors; use it alongside the scree plot.

Example: EFA on survey data

We continue with the same 15-item survey data set. The scree plot helps us decide how many factors to extract:

Code
eigenvalues <- eigen(cor(surveydata))

eigenvalues$values |>
  as.data.frame() |>
  dplyr::rename(Eigenvalues = 1) |>
  dplyr::mutate(Factors = seq_len(dplyr::n())) |>
  ggplot(aes(x = Factors, y = Eigenvalues, label = round(Eigenvalues, 2))) +
  geom_bar(stat = "identity", fill = "lightgray") +
  geom_line(color = "red", linewidth = 1) +
  geom_point() +
  geom_text(vjust = -1.2) +
  geom_hline(yintercept = 1, color = "blue", linetype = "dotted") +
  theme_bw() +
  coord_cartesian(ylim = c(0, 10)) +
  ggtitle("Scree plot: 15-item survey")

There is a clear elbow at 3 factors — eigenvalues drop sharply after the third. The blue dotted line shows the Kaiser criterion (eigenvalue = 1): again suggesting 3 factors. We proceed with 3.

Code
factoranalysis <- factanal(surveydata, 3)
print(factoranalysis,
  digits = 2,
  cutoff = 0.2,  # suppress loadings below .2 for readability
  sort   = TRUE
)

Call:
factanal(x = surveydata, factors = 3)

Uniquenesses:
    Q01_Outgoing     Q02_Outgoing     Q03_Outgoing     Q04_Outgoing 
            0.09             0.06             0.12             0.07 
    Q05_Outgoing Q06_Intelligence Q07_Intelligence Q08_Intelligence 
            0.14             0.10             0.13             0.10 
Q09_Intelligence Q10_Intelligence     Q11_Attitude     Q12_Attitude 
            0.28             0.41             0.08             0.14 
    Q13_Attitude     Q14_Attitude     Q15_Attitude 
            0.04             0.09             0.06 

Loadings:
                 Factor1 Factor2 Factor3
Q06_Intelligence -0.82    0.25    0.41  
Q07_Intelligence -0.80            0.47  
Q08_Intelligence -0.85            0.42  
Q09_Intelligence -0.79            0.29  
Q11_Attitude      0.96                  
Q12_Attitude      0.92                  
Q13_Attitude      0.97                  
Q14_Attitude      0.95                  
Q15_Attitude      0.96                  
Q01_Outgoing              0.94          
Q02_Outgoing              0.96          
Q03_Outgoing              0.93          
Q04_Outgoing              0.96          
Q05_Outgoing              0.92          
Q10_Intelligence -0.22   -0.46    0.57  

               Factor1 Factor2 Factor3
SS loadings       7.29    4.78    1.02
Proportion Var    0.49    0.32    0.07
Cumulative Var    0.49    0.80    0.87

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 62.79 on 63 degrees of freedom.
The p-value is 0.484 

Interpreting the output

Uniqueness scores — the proportion of each variable’s variance not explained by the common factors. A high uniqueness score (close to 1) means the variable shares little variance with any of the factors and may be a poor indicator of any latent construct. Note that Q10_Intelligence has a high uniqueness score, consistent with what the MDS plot showed.

Factor loadings — correlations between each observed variable and each latent factor. Loadings indicate how strongly each variable is associated with each factor. Conventional thresholds (Field, Miles, and Field 2012, 762):

Loading magnitude Interpretation
< 0.4 Low — weak association
0.4 – 0.6 Acceptable
> 0.6 High — strong association

SS loadings — the sum-of-squared loadings for each factor, which represents the variance the factor explains across all variables. Each factor should ideally have SS loadings > 1. If a factor falls below 1, it may not be contributing meaningfully and could be removed.

The p-value — tests whether the current factor solution fits the data. A non-significant p-value (p > .05) suggests the number of factors is sufficient. A significant p-value suggests you may need more factors — though with small samples this test has low power.

Visualising the factor structure

Code
fadat <- data.frame(
  Factor1 = factoranalysis$loadings[, 1],
  Factor2 = factoranalysis$loadings[, 2],
  Items   = names(surveydata)
)

fadat |>
  ggplot(aes(x = Factor1, y = Factor2, label = Items, color = Factor1)) +
  geom_text() +
  theme_bw() +
  coord_cartesian(xlim = c(-1.5, 1.5), ylim = c(-1, 1.5)) +
  scale_color_viridis_b() +
  labs(
    x     = "Factor 1 (49%)",
    y     = "Factor 2 (32%)",
    title = "EFA: Factor 1 vs. Factor 2 loadings"
  )

Three groups are clearly visible, corresponding to the three constructs. Q10_Intelligence sits apart from the other intelligence items, further confirming that it is a problematic item that does not reliably reflect the intelligence construct.

Exercises: Factor Analysis

Q4. In the EFA output above, Q05_Outgoing has a uniqueness score of 0.08. Q10_Intelligence has a uniqueness score of 0.61. What does this difference mean for the measurement quality of each item?






When to Use Which Method?

Section Overview

What you will learn: A structured decision guide for choosing between PCA, MDS, and FA based on your research question, data structure, and goals

Choosing the right dimension reduction method requires thinking carefully about what you want to achieve and what your data look like. The following questions guide the decision:

Decision framework

1. What is your primary goal?

  • Reducing many variables to a smaller set for downstream analysis (regression, clustering, visualisation)PCA is usually the first choice. It is computationally efficient, makes no model assumptions, and the components are well-defined mathematically.

  • Visualising which observations (not variables) are most similar to each otherMDS is the right tool. You need a distance or dissimilarity matrix between cases.

  • Identifying hidden latent constructs that explain shared variance among variablesFA is appropriate. It explicitly models each variable as a function of common factors plus unique variance.

2. What is your starting point?

  • Raw variable measurements → PCA or FA (compute covariance or correlation matrix internally)
  • A pre-computed distance or dissimilarity matrix → MDS

3. Are your variables correlated?

  • Highly correlated variables suggest a small number of underlying dimensions → FA or PCA will work well
  • Variables with low mutual correlations → dimension reduction may not be appropriate; each variable may capture unique information

4. Do you have a theoretical model?

  • No prior hypothesis about factor structure → EFA
  • Prior hypothesis to test → CFA (via lavaan)
  • No model, just want to compress data → PCA

Summary decision table

Situation Recommended method
50 frequency features, want to find dominant patterns PCA
20 languages described by typological features, want to map similarity MDS
30 survey items, want to find underlying attitude dimensions EFA
Hypothesis: survey items reflect 4 specific factors CFA (lavaan)
Want to reduce features before running a regression PCA
Have raw similarity ratings between texts MDS
Want to validate a measurement instrument EFA → CFA
Exercises: Method choice

Q5. A researcher has collected responses to 40 Likert-scale items from 300 university students as part of a study on language learning motivation. She suspects that the items reflect 5 underlying motivational dimensions but is not certain. Which method should she use, and why?






Summary

This tutorial has introduced three fundamental dimension reduction methods — PCA, MDS, and Factor Analysis — and shown how to implement each in R.

PCA transforms data into a new set of orthogonal axes (principal components) that capture decreasing proportions of variance. It is the workhorse method for data reduction and visualisation. Key steps: centre and scale the data, run prcomp(), inspect the scree plot, extract loading scores, and plot the first two PCs. The first PC captures the most variance; subsequent PCs each capture less. Eigenvalues quantify variance captured; eigenvectors define the direction of each component.

MDS maps pairwise distances between observations into a low-dimensional visual space. It starts from a distance matrix rather than raw measurements. Classical MDS via cmdscale() is equivalent to PCA on the distance matrix. MDS is the right choice when your research question is about which observations are most similar to each other.

Factor Analysis models observed variables as noisy reflections of underlying latent factors. EFA via factanal() extracts factors, applies rotation (Varimax by default) to maximise interpretability, and reports factor loadings (the correlations between observed variables and latent factors) and uniqueness scores (the proportion of variance not explained by any factor). The scree plot and Kaiser criterion help determine the number of factors. Items with high uniqueness scores are poor indicators of any latent construct and should be reviewed.

Choosing between methods comes down to your research goal: data reduction and visualisation → PCA; mapping observation similarity → MDS; identifying latent constructs → EFA → CFA.

Where to go next

Citation & Session Info

Schweinberger, Martin. 2026. Dimension Reduction Methods with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/dimensionredux_tutorial/dimensionredux_tutorial.html (Version 2026.05.01).

@manual{schweinberger2026dimensionredux_tutorial,
  author       = {Schweinberger, Martin},
  title        = {Dimension Reduction Methods with R},
  note         = {tutorials/dimensionredux_tutorial/dimensionredux_tutorial.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.05.01}
}
AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. The structure, conceptual explanations, worked examples, R code, and exercises build on an earlier tutorial by Martin Schweinberger. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.

Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] checkdown_0.0.13 tufte_0.13       psych_2.4.12     report_0.6.3    
 [5] factoextra_1.0.7 ggplot2_4.0.2    MASS_7.3-61      tidyr_1.3.2     
 [9] stringr_1.6.0    flextable_0.9.11 here_1.0.2       dplyr_1.2.0     

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.56               htmlwidgets_1.6.4      
 [4] insight_1.4.6           ggrepel_0.9.8           lattice_0.22-6         
 [7] vctrs_0.7.2             tools_4.4.2             generics_0.1.4         
[10] parallel_4.4.2          tibble_3.3.1            pkgconfig_2.0.3        
[13] data.table_1.17.0       RColorBrewer_1.1-3      S7_0.2.1               
[16] uuid_1.2-1              lifecycle_1.0.5         compiler_4.4.2         
[19] farver_2.1.2            textshaping_1.0.0       mnormt_2.1.2           
[22] codetools_0.2-20        litedown_0.9            fontquiver_0.2.1       
[25] fontLiberation_0.1.0    htmltools_0.5.9         yaml_2.3.10            
[28] pillar_1.11.1           openssl_2.3.2           nlme_3.1-166           
[31] fontBitstreamVera_0.1.1 commonmark_2.0.0        tidyselect_1.2.1       
[34] zip_2.3.2               digest_0.6.39           stringi_1.8.7          
[37] purrr_1.2.1             labeling_0.4.3          rprojroot_2.1.1        
[40] fastmap_1.2.0           grid_4.4.2              cli_3.6.5              
[43] magrittr_2.0.4          patchwork_1.3.0         withr_3.0.2            
[46] gdtools_0.5.0           scales_1.4.0            rmarkdown_2.30         
[49] officer_0.7.3           askpass_1.2.1           ragg_1.5.1             
[52] evaluate_1.0.5          knitr_1.51              viridisLite_0.4.2      
[55] markdown_2.0            rlang_1.1.7             Rcpp_1.1.1             
[58] glue_1.8.0              BiocManager_1.30.27     xml2_1.3.6             
[61] renv_1.1.7              rstudioapi_0.17.1       jsonlite_2.0.0         
[64] R6_2.6.1                systemfonts_1.3.1      

Back to top

Back to LADAL home


References

Davison, Mark L. 1983. Multidimensional Scaling. Probability and Mathematical Statistics. New York: John Wiley & Sons.
Field, Andy, Jeremy Miles, and Zoë Field. 2012. Discovering Statistics Using r. London: SAGE Publications.
Jolliffe, I. T. 2002. Principal Component Analysis. 2nd ed. Springer Series in Statistics. New York: Springer. https://doi.org/10.1007/b98835.
Kaiser, Henry F. 1960. “The Application of Electronic Computers to Factor Analysis.” Educational and Psychological Measurement 20 (1): 141–51. https://doi.org/10.1177/001316446002000116.
Kim, Jae-On, and Charles W. Mueller. 1978. Introduction to Factor Analysis: What It Is and How to Do It. Quantitative Applications in the Social Sciences 13. Newbury Park, CA: SAGE Publications. https://doi.org/10.4135/9781412984652.