To be able to follow this tutorial, we suggest you check out and familiarize yourself with the content of the following **R Basics** and **Data Science Basics** tutorials:

- [Getting started with R](https://ladal.edu.au/intror.html)
- [Basic Concepts in Quantitative Research](https://ladal.edu.au/basicquant.html)
- [Loading, saving, and generating data in R](https://ladal.edu.au/load.html)
- [Handling tables in R](https://ladal.edu.au/table.html)

Click [**here**](https://ladal.edu.au/content/clust.Rmd)^[If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the [**bibliography file**](https://ladal.edu.au/content/bibliography.bib) and store it in the same folder where you store the Rmd file.] to download the **entire R Notebook** for this tutorial.

A more elaborate and highly recommendable introduction to cluster analysis is @kassambara2017practical. Other very useful resources are, e.g., @king2015cluster; @kettenring2006practice; @romesburg2004cluster; and @blashfield1988methods. **Preparation and session set up** This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R [here](https://slcladal.github.io/intror.html). For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time). ```{r prep1, echo=T, eval = F, message=FALSE, warning=FALSE} # set options options(stringsAsFactors = F) # no automatic data transformation options("scipen" = 100, "digits" = 4) # suppress math annotation # install packages install.packages("cluster") install.packages("factoextra") install.packages("seriation") install.packages("pvclust") install.packages("ape") install.packages("vcd") install.packages("exact2x2") install.packages("flextable") install.packages("tibble") install.packages("gplots") # install klippy for copy-to-clipboard button in code chunks install.packages("remotes") remotes::install_github("rlesur/klippy") ``` In a next step, we load the packages. ```{r prep2, message=FALSE, warning=FALSE} # load packages library(cluster) library(factoextra) library(seriation) library(pvclust) library(ape) library(vcd) library(exact2x2) library(flextable) library(tibble) library(gplots) # activate klippy for copy-to-clipboard button klippy::klippy() ``` Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go. # Cluster Analysis The most common method in linguistics that is sued to detect groups in data are cluster analyses. Cluster analyses are common in linguistics because they not only detect commonalities based on the frequency or occurrence of features but they also allow to visualize when splits between groups have occurred and are thus the method of choice in historical linguistics to determine and show genealogical relationships. ## Underlying Concepts{-} The next section focuses on the basic idea that underlies all cluster analyses. WE will have a look at some very basic examples to highlight and discuss the principles that cluster analyses rely on. The underlying idea of cluster analysis is very simple and rather intuitive as we ourselves perform cluster analyses every day in our lives. This is so because we group things together under certain labels and into concepts. The first example used to show this, deals with types of trees and how we group these types of trees based on their outward appearance. Imagine you see six trees representing different types of trees: a pine tree, a fir tree, an oak tree, a beech tree, a phoenix palm tree, and a nikau palm tree. Now, you were asked to group these trees according to similarity. Have a look at the plot below and see whether you would have come up with a similar type of grouping. ```{r cl1, eval = T, echo = F, results = 'asis', message=FALSE, warning=FALSE} x <- 1:10 y <- 1:10 plot(x, y, type = "n", ylim = c(-.5,10), xlim = c(0,5), axes = F, xlab = "", ylab = "") text("Trees", x = 2.25, y = 10, cex = 1.5) text("Conifers", x = .5, y = 6.5, cex = 1.5) text("Broad leaf", x = 2.25, y = 6.5, cex = 1.5) text("Palms", x = 4, y = 6.5, cex = 1.5) text("Pine tree", x = .25, y = 1.5, srt=90, cex = 1.5) text("Fir tree", x = .75, y = 1.5, srt=90, cex = 1.5) text("Oak tree", x = 2, y = 1.5, srt=90, cex = 1.5) text("Beech tree", x = 2.5, y = 1.5, srt=90, cex = 1.5) text("Phoenix palm", x = 3.75, y = 1.75, srt=90, cex = 1.5) text("Nikau palm", x = 4.25, y = 1.5, srt=90, cex = 1.5) # lines(x = c(.5, 1.75), y = c(7, 9), lwd = 2) lines(x = c(2.25, 2.25), y = c(7, 9), lwd = 2) lines(x = c(4, 2.75), y = c(7, 9), lwd = 2) # lines(x = c(.5, .5), y = c(6, 4.5), lwd = 2) lines(x = c(2.25, 2.25), y = c(6, 4.5), lwd = 2) lines(x = c(4, 4), y = c(6, 4.75), lwd = 2) # lines(x = c(.25, .75), y = c(4.5, 4.5), lwd = 2) lines(x = c(2, 2.5), y = c(4.5, 4.5), lwd = 2) lines(x = c(3.75, 4.25), y = c(4.75, 4.75), lwd = 2) # lines(x = c(.25, .25), y = c(4.5, 4), lwd = 2) lines(x = c(.75, .75), y = c(4.5, 4), lwd = 2) lines(x = c(2, 2), y = c(4.5, 4), lwd = 2) lines(x = c(2.5, 2.5), y = c(4.5, 4), lwd = 2) ``` An alternative way to group the trees would be the following. ```{r cl2, eval = T, echo = F, results = 'asis', message=FALSE, warning=FALSE} x <- 1:10 y <- 1:10 plot(x, y, type = "n", ylim = c(-.5,15), xlim = c(0,5), axes = F, xlab = "", ylab = "") text("Trees", x = 2.25, y = 15, cex = 1) text("Conifers", x = .5, y = 6.5, cex = 1) text("Broad leaf", x = 2.25, y = 6.5, cex = 1) text("Palm Trees", x = 3.5, y = 10, cex = 1) text("Pine tree", x = .25, y = 1.5, srt=90, cex = 1) text("Fir tree", x = .75, y = 1.5, srt=90, cex = 1) text("Oak tree", x = 2, y = 1.5, srt=90, cex = 1) text("Beech tree", x = 2.5, y = 1.5, srt=90, cex = 1) text("Phoenix palm", x = 3.25, y = 1.75, srt=90, cex = 1) text("Nikau palm", x = 3.75, y = 1.5, srt=90, cex = 1) # lines(x = c(1.5, 2.15), y = c(11, 13.5), lwd = 2) lines(x = c(3.5, 2.5), y = c(11, 13.5), lwd = 2) lines(x = c(.5, 1.5), y = c(7.25, 11), lwd = 2) lines(x = c(1.5, 2.25), y = c(11, 7.25), lwd = 2) # lines(x = c(.5, .5), y = c(6, 4.5), lwd = 2) lines(x = c(2.25, 2.25), y = c(6, 4.5), lwd = 2) lines(x = c(3.5, 3.5), y = c(8.75, 6.25), lwd = 2) # lines(x = c(.25, .75), y = c(4.5, 4.5), lwd = 2) lines(x = c(2, 2.5), y = c(4.5, 4.5), lwd = 2) lines(x = c(3.25, 3.75), y = c(6.25, 6.25), lwd = 2) # lines(x = c(.25, .25), y = c(4.5, 4), lwd = 2) lines(x = c(.75, .75), y = c(4.5, 4), lwd = 2) lines(x = c(2, 2), y = c(4.5, 4), lwd = 2) lines(x = c(2.5, 2.5), y = c(4.5, 4), lwd = 2) ``` In this display, conifers and broad-leaf trees are grouped together because there are more similar to each other compared to palm trees. This poses the question of what is meant by similarity. Consider the display below. ```{r cl3, eval = T, echo = F, results = 'asis', message=FALSE, warning=FALSE} # generate data y <- c(1, 3.1, 1.2, 2.3, 3.4, 2.5, 1.6, 2.7, 3.8, 2.9) x <- c(1:10) plot(x, y, type = "l", ylim = c(0,11), xaxt='n', yaxt='n', ann=FALSE, lwd = 2, ylab = "", xlab = "") lines(x = 1:10, y = c(5, 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9), col = "blue", lwd = 2) lines(x = 1:10, y = c(8, 10.1, 8.2, 9.3, 10.4, 9.5, 8.6, 9.7, 10.8, 9.9), col = "red", lwd = 2) ``` Are the red and the blue line more similar because they have the same shape or are the red and the black line more similar because they are closer together? There is no single correct answer here. Rather the plot intends to raise awareness about the fact that how cluster analyses group data depends on how similarity is defined in the respective algorithm. Let's consider another example to better understand how cluster analyses determine which data points should be merged when. Imagine you have five students and want to group them together based on their overall performance in school. The data that you rely on are their grades in math, music, and biology (with 1 being the best grade and 6 being the worst). ```{r cl4, eval = T, echo = F, results = 'asis', message=FALSE, warning=FALSE} # similarity students <- matrix(c(2, 3, 2, 1, 3, 2, 1, 2, 1, 2, 4, 4, 3, 4, 3), nrow = 5, byrow = T) students <- as.data.frame(students) colnames(students) <- c("Math", "Music", "Biology") rownames(students) <- c("StudentA", "StudentB", "StudentC", "StudentD", "StudentE") ``` ```{r cl5, eval = T, echo = F, results = 'asis', message=FALSE, warning=FALSE} students %>% as.data.frame() %>% tibble::rownames_to_column("Student") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Sample of five students and their grades in math, music, and biology.") %>% flextable::border_outer() ``` The first step in determining the similarity among students is to create a distance matrix. ```{r cl6, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} diststudents <- dist(students, method = "manhattan") # create a distance matrix ``` The distance matrix below shows that Student A and Student B only differ by one grade. Student B and Student C differ by 2 grades. Student A and Student C differ by 3 grades and so on. ```{r cl7, eval = T, echo = F, results = 'asis', message=FALSE, warning=FALSE} diststudentstb <- matrix(c("1", "3", "3","3", "", "2", "4", "4","", "", "6", "6", "", "", "", "2"), nrow = 4, byrow = F) # add column and row names colnames(diststudentstb) <- c("StudentA", "StudentB", "StudentC", "StudentD") rownames(diststudentstb) <- c("StudentB", "StudentC", "StudentD", "StudentE") diststudentstb %>% as.data.frame() %>% tibble::rownames_to_column("Student") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Distance matrix based of students based on grades in math, music, and biology.") %>% flextable::border_outer() ``` Based on this distance matrix, we can now implement a cluster analysis in R. ## Cluster Analysis on Numeric Data{-} To create a simple cluster object in R, we use the `hclust` function from the `cluster` package. The resulting object is then plotted to create a dendrogram which shows how students have been amalgamated (combined) by the clustering algorithm (which, in the present case, is called `ward.D`). ```{r cl8, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} # create hierarchical cluster object with ward.D as linkage method clusterstudents <- hclust(diststudents, method="ward.D") # plot result as dendrogram plot(clusterstudents, hang = 0) ``` Let us have a look at how the clustering algorithm has amalgamated the students. The amalgamation process takes the distance matrix from above as a starting point and, in a first step, has merged Student A and Student B (because they were the most similar students in the data based on the distance matrix). After collapsing Student A and Student B, the resulting distance matrix looks like the distance matrix below (notice that Student A and Student B now form a cluster that is represented by the means of the grades of the two students). ```{r cl9, eval=F, echo=T, message=FALSE, warning=FALSE} students2 <- matrix(c(1.5, 3, 2, 1, 2, 1, 2, 4, 4, 3, 4, 3), nrow = 4, byrow = T) students2 <- as.data.frame(students2) rownames(students2) <- c("Cluster1", "StudentC", "StudentD", "StudentE") diststudents2 <- dist(students2, method = "manhattan") ``` ```{r cl10, eval = T, echo = F, results = 'asis'} diststudentstb <- matrix(c("2.5","3.5","3.5","","6.0","6.0","","","2.0"), nrow = 3, byrow = F) # add column and row names colnames(diststudentstb) <- c("Cluster 1", "Student C", "Student D") rownames(diststudentstb) <- c("Student C", "Student D", "Student E") diststudentstb %>% as.data.frame() %>% tibble::rownames_to_column("Student") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Distance matrix of students based on grades in math, music, and biology.") %>% flextable::border_outer() ``` The next lowest distance now is 2.0 between Student D and Student E which means that these two students are merged next. The resulting distance matrix is shown below. ```{r cl11, eval=F, echo=T, message=FALSE, warning=FALSE} students3 <- matrix(c(1.5,3,2,1,2,1,2.5,4,3.5), nrow = 3, byrow = T) students3 <- as.data.frame(students3) rownames(students3) <- c("Cluster1", "StudentC", "Cluster2") diststudents3 <- dist(students3, method = "manhattan") ``` ```{r cl12, eval = T, echo=F, message=FALSE, warning=FALSE} diststudentstb <- matrix(c("2.5", "3.5", "", "6.0"), nrow = 2, byrow = F) # add column and row names colnames(diststudentstb) <- c("Cluster 1", "Student C") rownames(diststudentstb) <- c("Student C", "Cluster 2") diststudentstb %>% as.data.frame() %>% tibble::rownames_to_column("Student") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Distance matrix based of students based on grades in math, music, and biology.") %>% flextable::border_outer() ``` Now, the lowest distance value occurs between Cluster 1 and Student C. Thus, Student C and Cluster 1 are merged. In the final step, the Cluster 2 is merged with the new cluster encompassing Student C and Cluster 1. This amalgamation process can then be displayed visually as a dendrogram (see above). How and which elements are merged depends on the what is understood as distance. Since "distance" is such an important concept in cluster analyses, we will briefly discuss this notion to understand why there are so many different types of clustering algorithms and this cluster analyses. ## Distance and Similarity Measures{-} To understand how a cluster analysis determines to which cluster a given data point belongs, we need to understand what different distance measures represent. Have a look at the Figure below which visually represents three different ways to conceptualize distance. ```{r cl13, echo = F, message=F, warning=F} par(mar=c(1,1,1,1)) # define margin width of the plot x <- c(1,5) # define an x value y <- c(1,5) # define a y value plot(x, y, pch = 20, cex = 1, axes = F, las = 1, xlab = "", ylab = "", xlim = c(0,7), ylim = c(0,10)) text(0.5, .5, "Point A", cex = 1) text(5, 5.5, "Point B", cex = 1) lines(x = c(1, 5), y = c(1, 5), type = "l", lty = 3, lwd = 2, col = "red") lines(x = c(1, 5), y = c(1, 1), type = "l", lty = 2, lwd = 2, col = "blue") lines(x = c(5, 5), y = c(1, 5), type = "l", lty = 4, lwd = 2, col = "green") lines(x = c(.9, 5), y = c(.9, .9), type = "l", lty = 4, lwd = 2, col = "green") legend("topleft", inset=.05, title="", bty = "n", lty = c(3, 2, 4), lwd = 2, c("euclidean distance", "maximum distance", "manhatten distance"), col=c("red", "blue", "green"), horiz=F, cex = 1); par(mar=c(5.1,4.1,4.1,2.1)) ``` The Figure above depicts three ways to measure distance: the `euclidean` distance represents the distance between points as the hypotenuse of the x- and y-axis distances while the "maximum distance" represents distance as the longer distance of either the distance on the x- or the y-axis. The `manhatten` distance (or block distance) is the sum of the distances on the x- and the y-axis. We will now turn to another example in order to delve a little deeper into how clustering algorithms work. In this example, we will find cluster of varieties of English based on the relative frequency of selected non-standard features (such as the relative frequencies of cleft constructions and tag questions). As a first step, we generate some fictional data set for this analysis. ```{r cl14, message=F, warning=F} # generate data IrishEnglish <- round(sqrt((rnorm(10, 9.5, .5))^2), 3) ScottishEnglish <- round(sqrt((rnorm(10, 9.3, .4))^2), 3) BritishEnglish <- round(sqrt((rnorm(10, 6.4, .7))^2), 3) AustralianEnglish <- round(sqrt((rnorm(10, 6.6, .5))^2), 3) NewZealandEnglish <- round(sqrt((rnorm(10, 6.5, .4))^2), 3) AmericanEnglish <- round(sqrt((rnorm(10, 4.6, .8))^2), 3) CanadianEnglish <- round(sqrt((rnorm(10, 4.5, .7))^2), 3) JamaicanEnglish <- round(sqrt((rnorm(10, 1.4, .2))^2), 3) PhillipineEnglish <- round(sqrt((rnorm(10, 1.5, .4))^2), 3) IndianEnglish <- round(sqrt((rnorm(10, 1.3, .5))^2), 3) clus <- data.frame(IrishEnglish, ScottishEnglish, BritishEnglish, AustralianEnglish, NewZealandEnglish, AmericanEnglish, CanadianEnglish, JamaicanEnglish, PhillipineEnglish, IndianEnglish) # add row names rownames(clus) <- c("nae_neg", "like", "clefts", "tags", "youse", "soitwas", "dt", "nsr", "invartag", "wh_cleft") ``` ```{r cl14b, echo = F, message=F, warning=F} clus %>% as.data.frame() %>% tibble::rownames_to_column("Feature") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Frequencies of on-standard features across selected varieties of English.") %>% flextable::border_outer() ``` As a next step, we create a cluster object based on the data we have just generated. ```{r cl15, message=F, warning=F} # clean data clusm <- as.matrix(clus) clust <- t(clusm) # transpose data clust <- na.omit(clust) # remove missing values clusts <- scale(clust) # standardize variables clusts <- as.matrix(clusts) # convert into matrix ``` ```{r cl15b, echo = F, message=F, warning=F} clust %>% as.data.frame() %>% tibble::rownames_to_column("Variety") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Scaled frequencies of on-standard features across selected varieties of English.") %>% flextable::border_outer() ``` We assess if data is "clusterable" by testing if the data contains non-randomness. To this end, we calculate the *Hopkins statistic* which indicates how similar the data is to a random distribution. * A Hopkins value of 0.5 indicates that the data is random and that there are no inherent clusters. * If the Hopkins statistic is close to 1, then the data is highly clusterable. * Values of 0 indicate that the data is uniform [@aggarwal2015data, pp. 158]. The `n` in the `get_clust_tendency` functions represents the maximum number of clusters to be tested which should be number of predictors in the data. ```{r cl16, message=FALSE, warning=FALSE} # apply get_clust_tendency to cluster object clusttendency <- get_clust_tendency(clusts, # define number of points from sample space n = 9, gradient = list( # define color for low values low = "steelblue", # define color for high values high = "white")) clusttendency[1] ``` As the Hopkins value is substantively higher than .5 (randomness) and closer to 1 (highly clusterable) than to .5, thus indicating that there is sufficient structure in the data to warrant a cluster analysis. As such, we can assume that there are actual clusters in the data and continue by generating a distance matrix using euclidean distances. ```{r cl17, eval = T, echo = T, message=FALSE, warning=FALSE} clustd <- dist(clusts, # create distance matrix method = "euclidean") # use euclidean (!) distance ``` ```{r cl17b, echo = F, message=F, warning=F} round(clustd, 2) %>% as.matrix() %>% as.data.frame() %>% tibble::rownames_to_column("Variety") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Distance matrix of scaled frequencies of non-standard features across selected varieties of English.") %>% flextable::border_outer() ``` Below are other methods to create distance matrices with some comments on when using which metric is appropriate. ```{r cl18, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} # create distance matrix (euclidean method: not good when dealing with many dimensions) clustd <- dist(clusts, method = "euclidean") # create distance matrix (maximum method: here the difference between points dominates) clustd_maximum <- round(dist(clusts, method = "maximum"), 2) # create distance matrix (manhattan method: most popular choice) clustd_manhatten <- round(dist(clusts, method = "manhattan"), 2) # create distance matrix (canberra method: for count data only - focuses on small differences and neglects larger differences) clustd_canberra <- round(dist(clusts, method = "canberra"), 2) # create distance matrix (binary method: for binary data only!) clustd_binary <- round(dist(clusts, method = "binary"), 2) # create distance matrix (minkowski method: is not a true distance measure) clustd_minkowski <- round(dist(clusts, method = "minkowski"), 2) # distance method for words: daisy (other possible distances are "manhattan" and "gower") clustd_daisy <- round(daisy(clusts, metric = "euclidean"), 2) ``` If you call the individual distance matrices, you will see that depending on which distance measure is used, the distance matrices differ dramatically! Have a look at the distance matrix created using the `manhatten` metric and compare it to the distance matrix created using the `euclidian` metric (see above). ```{r cl19, eval = F, echo = T, message=FALSE, warning=FALSE} clustd_maximum ``` ```{r cl19b, eval = T, echo = F, message=FALSE, warning=FALSE} clustd_maximum %>% as.matrix() %>% as.data.frame() %>% tibble::rownames_to_column("Variety") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Distance matrix of selected varieties of English.") %>% flextable::border_outer() ``` Next, we create a distance plot using the `distplot` function. If the distance plot shows different regions (non-random, non-uniform gray areas) then clustering the data is permittable as the data contains actual structures. ```{r cl20, eval = T, echo=T, message=FALSE, warning=FALSE} # create distance plot dissplot(clustd) ``` The most common method for clustering is called `ward.D` or `ward.D2`. Both of these linkage functions seek to minimize variance. This means that they cluster in a way that the amount of variance is at a minimum (comparable to the regression line in an *ordinary least squares* (OLS) design). ```{r cl21, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} # create cluster object cd <- hclust(clustd, method="ward.D2") # display dendrogram plot(cd, hang = -1) ``` We will briefly go over some other, alternative linkage methods. Which linkage method is and should be used depends on various factors, for example, the type of variables (nominal versus numeric) or whether the focus should be placed on commonalities or differences. ```{r cl22, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} # single linkage: cluster with nearest data point cd_single <- hclust(clustd, method="single") # create cluster object (ward.D linkage) cd_wardd <- hclust(clustd, method="ward.D") # create cluster object (ward.D2 linkage): # cluster in a way to achieve minimum variance cd_wardd2 <- hclust(clustd, method="ward.D2") # average linkage: cluster with closest mean cd_average <- hclust(clustd, method="average") # mcquitty linkage cd_mcquitty <- hclust(clustd, method="mcquitty") # median linkage: cluster with closest median cd_median <- hclust(clustd, method="median") # centroid linkage: cluster with closest prototypical point of target cluster cd_centroid <- hclust(clustd, method="centroid") # complete linkage: cluster with nearest/furthest data point of target cluster cd_complete <- hclust(clustd, method="complete") ``` Now, we determine the optimal number of clusters based on silhouette widths which shows the ratio of internal similarity of clusters against the similarity between clusters. If the silhouette widths have values lower than .2 then this indicates that clustering is not appropriate [@levshina2015linguistics 311]. The function below displays the silhouette width values of 2 to 8 clusters. ```{r cl23, eval = T, echo = T, message=FALSE, warning=FALSE} optclus <- sapply(2:8, function(x) summary(silhouette(cutree(cd, k = x), clustd))$avg.width) optclus # inspect results optnclust <- which(optclus == max(optclus)) # determine optimal number of clusters groups <- cutree(cd, k=optnclust) # cut tree into optimal number of clusters ``` The optimal number of clusters is the cluster solution with the highest silhouette width. We cut the tree into the optimal number of clusters and plot the result. ```{r cl24, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} groups <- cutree(cd, k=optnclust) # cut tree into optimal clusters plot(cd, hang = -1, cex = .75) # plot result as dendrogram rect.hclust(cd, k=optnclust, border="red") # draw red borders around clusters ``` In a next step, we aim to determine which factors are particularly important for the clustering - this step is comparable to measuring the effect size in inferential designs. ```{r cl25, eval = T, echo = T, message=FALSE, warning=FALSE} # which factors are particularly important celtic <- clusts[c(1,2),] others <- clusts[-c(1,2),] # calculate column means celtic.cm <- colMeans(celtic) others.cm <- colMeans(others) # calculate difference between celtic and other englishes diff <- celtic.cm - others.cm sort(diff, decreasing = F) ``` ```{r cl26, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} plot(sort(diff), # y-values 1:length(diff), # x-values type= "n", # plot type (empty) cex.axis = .75, # axis font size cex.lab = .75, # label font size xlab ="Prototypical for Non-Celtic Varieties (Cluster 2) <-----> Prototypical for Celtic Varieties (Cluster 1)", # x-axis label yaxt = "n", # no y-axis tick marks ylab = "") # no y-axis label text(sort(diff), 1:length(diff), names(sort(diff)), cex = .75) # plot text into plot ``` ```{r cl27, eval = T, echo = T, message=FALSE, warning=FALSE} Outer <- clusts[c(6:8),] # data of outer circle varieties Inner <- clusts[-c(6:8),] # data of inner circle varieties Outer.cm <- colMeans(Outer) # column means for outer circle Inner.cm <- colMeans(Inner) # column means for inner circle diff <- Outer.cm - Inner.cm # difference between inner and outer circle sort(diff, decreasing = F) # order difference between inner and outer circle ``` ```{r cl28, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} plot( # start plot sort(diff), # y-values 1:length(diff), # x-values type= "n", # plot type (empty) cex.axis = .75, # axis font size cex.lab = .75, # label font size xlab ="Prototypical for Inner Circle Varieties (Cluster 2) <-----> Prototypical for Outer Circle Varieties (Cluster 1)", # x-axis label yaxt = "n", # no y-axis tick marks ylab = "") # no y-axis label text(sort(diff), 1:length(diff), names(sort(diff)), cex = .75) # plot text into plot ``` We see that discourse *like* is typical for other varieties and that the use of *youse* as 2^nd^ person plural pronoun and invariant tags are typical for Celtic Englishes. We will now test whether the cluster is justified by validating the cluster solution using bootstrapping. ```{r cl29, eval = T, echo=T, message=FALSE, warning=FALSE, paged.print=FALSE} res.pv <- pvclust(clus, # apply pvclust method to clus data method.dist="euclidean", # use eucledian distance method.hclust="ward.D2", # use ward.d2 linkage nboot = 100) # use 100 bootstrap runs ``` The clustering provides approximately unbiased p-values and bootstrap probability value [see @levshina2015linguistics 316]. ```{r cl30, eval = T, echo=T, message=FALSE, warning=FALSE, paged.print=FALSE} plot(res.pv, cex = .75) pvrect(res.pv) ``` We can also use other packages to customize the dendrograms. ```{r cl31, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} plot(as.phylo(cd), # plot cluster object cex = 0.75, # .75 font size label.offset = .5) # .5 label offset ``` One useful customization is to display an unrooted rather than a rooted tree diagram. ```{r cl32, eval = T, echo = T, results = 'asis', message=FALSE, warning=FALSE} # plot as unrooted tree plot(as.phylo(cd), # plot cluster object type = "unrooted", # plot as unrooted tree cex = .75, # .75 font size label.offset = 1) # .5 label offset ``` ## Cluster Analysis on Nominal Data{-} So far, all analyses were based on numeric data. However, especially when working with language data, the data is nominal or categorical rather than numeric. The following will thus show to implement a clustering method for nominal data. In a first step, we will create a simple data set representing the presence and absence of features across varieties of English. ```{r cl33, eval = T, echo = T, message=FALSE, warning=FALSE} # generate data IrishEnglish <- c(1,1,1,1,1,1,1,1,1,1) ScottishEnglish <- c(1,1,1,1,1,1,1,1,1,1) BritishEnglish <- c(0,1,1,1,0,0,1,0,1,1) AustralianEnglish <- c(0,1,1,1,0,0,1,0,1,1) NewZealandEnglish <- c(0,1,1,1,0,0,1,0,1,1) AmericanEnglish <- c(0,1,1,1,0,0,0,0,1,0) CanadianEnglish <- c(0,1,1,1,0,0,0,0,1,0) JamaicanEnglish <- c(0,0,1,0,0,0,0,0,1,0) PhillipineEnglish <- c(0,0,1,0,0,0,0,0,1,0) IndianEnglish <- c(0,0,1,0,0,0,0,0,1,0) clus <- data.frame(IrishEnglish, ScottishEnglish, BritishEnglish, AustralianEnglish, NewZealandEnglish, AmericanEnglish, CanadianEnglish, JamaicanEnglish, PhillipineEnglish, IndianEnglish) # add row names rownames(clus) <- c("nae_neg", "like", "clefts", "tags", "youse", "soitwas", "dt", "nsr", "invartag", "wh_cleft") # convert into factors clus <- apply(clus, 1, function(x){ x <- as.factor(x) }) ``` ```{r cl33b, eval = T, echo = F, message=FALSE, warning=FALSE} clus %>% as.data.frame() %>% tibble::rownames_to_column("Variety") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Occurrence of non-standard features across selected varieties of English.") %>% flextable::border_outer() ``` Now that we have our data, we will create a distance matrix but in contrast to previous methods, we will use a different distance measure that takes into account that we are dealing with nominal (or binary) data. ```{r cl34, eval = T, echo = T, message=FALSE, warning=FALSE} # clean data clusts <- as.matrix(clus) # create distance matrix clustd <- dist(clusts, method = "binary") # create a distance object with binary (!) distance ``` ```{r cl34b, eval = T, echo = F, message=FALSE, warning=FALSE} round(clustd, 2) %>% as.matrix() %>% as.data.frame() %>% tibble::rownames_to_column("Variety") %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Distance matrix of selected varieties of English.") %>% flextable::border_outer() ``` As before, we can now use hierarchical clustering to display the results as a dendrogram ```{r cl35, eval = T, echo=T, message=FALSE, warning=FALSE} # create cluster object (ward.D2 linkage) : cluster in a way to achieve minimum variance cd <- hclust(clustd, method="ward.D2") # plot result as dendrogram plot(cd, hang = -1) # display dendogram ``` In a next step, we want to determine which features are particularly distinctive for one cluster (the "Celtic" cluster containing Irish and Scottish English). ```{r cl36, eval = T, echo=T, message=FALSE, warning=FALSE} # create factor with celtic varieties on one hand and other varieties on other cluster <- as.factor(ifelse(as.character(rownames(clusts)) == "IrishEnglish", "1", ifelse(as.character(rownames(clusts)) == "ScottishEnglish", "1", "0"))) # convert into data frame clsts.df <- as.data.frame(clusts) # determine significance library(exact2x2) pfish <- fisher.exact(table(cluster, clsts.df$youse)) pfish[[1]] # determine effect size assocstats(table(cluster, clsts.df$youse)) assocstats(table(cluster, clsts.df$like)) ``` Clustering is a highly complex topic and there many more complexities to it. However, this should have helped to get you started. # Correspondence Analysis Correspondence analysis (CA) represents a multivariate statistical technique that provides a graphic method of exploring the relationship between variables in a contingency table. CA is conceptually similar to principal component analysis (PCA), but applies to categorical rather than continuous data. CA consists out of the following four steps: 1. Computing row and column averages 2. Computing expected values 3. Computing the residuals 4. Plotting residuals In this tutorial, we investigate similarities among amplifiers based on their co-occurrences (word embeddings) with adjectives. Adjective amplifiers are elements such as those in 1. to 5. 1. The *very*~amplifier~ *nice*~adjective~ man. 2. A *truely*~amplifier~ *remarkable*~adjective~ woman. 2. He was *really*~amplifier~ *hesitant*~adjective~. 4. The child was *awfully*~amplifier~ *loud*~adjective~. 5. The festival was *so*~amplifier~ *amazing*~adjective~! The similarity among adjective amplifiers can then be used to find clusters or groups of amplifiers that *behave* similarly and are interchangeable. To elaborate, adjective amplifiers are interchangeable with some variants but not with others (consider 6. to 8.; the question mark signifies that the example is unlikely to be used or grammatically not acceptable by L1 speakers of English). 6. The *very*~amplifier~ *nice*~adjective~ man. 7. The *really*~amplifier~ *nice*~adjective~ man. 8. ^?^The *completely*~amplifier~ *nice*~adjective~ man. We start by loading the data, and then displaying the data which is called `vsmdata` and consist of 5,000 observations of adjectives and contains two columns: one column with the adjectives (Adjectives) and another column which has the amplifiers (0 means that the adjective occurred without an amplifier). ```{r vsm1, message=F, warning=F} # load data vsmdata <- base::readRDS(url("https://slcladal.github.io/data/vsd.rda", "rb")) ``` ```{r vsm1b, eval = T, echo = F, message=FALSE, warning=FALSE} vsmdata %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Table showing amplification of English adjectives.") %>% flextable::border_outer() ``` For this tutorial, we will reduce the number of amplifiers and adjectives and thus simplify the data to render it easier to understand what is going on. To simplify the data, we remove + all non-amplified adjectives + the adjectives many and much + adjectives that are amplified less than 10 times In addition, we collapse all amplifiers that occur less than 20 times into a bin category (*other*). ```{r vsm2} # simplify data vsmdata_simp <- vsmdata %>% # remove non-amplifier adjectives dplyr::filter(Amplifier != 0, Adjective != "many", Adjective != "much") %>% # collapse infrequent amplifiers dplyr::group_by(Amplifier) %>% dplyr::mutate(AmpFreq = dplyr::n()) %>% dplyr::ungroup() %>% dplyr::mutate(Amplifier = ifelse(AmpFreq > 20, Amplifier, "other")) %>% # collapse infrequent adjectives dplyr::group_by(Adjective) %>% dplyr::mutate(AdjFreq = dplyr::n()) %>% dplyr::ungroup() %>% dplyr::mutate(Adjective = ifelse(AdjFreq > 10, Adjective, "other")) %>% dplyr::filter(Adjective != "other") %>% dplyr::select(-AmpFreq, -AdjFreq) ``` ```{r vsm2b, eval = T, echo = F, message=FALSE, warning=FALSE} vsmdata_simp %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .5, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "Table showing amplification of English adjectives (simplified).") %>% flextable::border_outer() ``` We now use a balloon plot to see if there are any potential correlations between amplifiers and adjectives. ```{r ca3} # 1. convert the data as a table dt <- as.matrix(table(vsmdata_simp)) # 2. Graph balloonplot(t(dt), main ="vsmdata_simp", xlab ="", ylab="", label = FALSE, show.margins = FALSE) ``` The balloon plot suggests that there are potential correlations as the dots (balloons) are not distributed evenly according to frequency. To validate if there is significant correlation between the amplifier types and the adjectives using a $\chi$^2^- test. ```{r ca4, warning=F, message=F} chisq <- chisq.test(dt) chisq ``` The $\chi$^2^- test confirms that there is a significant correlations between amplifier types and the adjectives. ```{r ca5} res.ca <- FactoMineR::CA(dt, graph = FALSE) # inspect results of the CA #print(res.ca) eig.val <- get_eigenvalue(res.ca) eig.val ``` The display of the *eigenvalues* provides information on the amount of variance that is explained by each dimension. The first dimension explains 49.87 percent of the variance, the second dimension explains another 30.34 percent of the variance, leaving all other variables with relative moderate explanatory power as they only account for 20 percent variance. We now plot and interpret the results of the CA. ```{r ca6} # repel= TRUE to avoid text overlapping (slow if many point) fviz_ca_biplot(res.ca, repel = TRUE, col.row = "orange", col.col = "darkgray") ``` The results of the CA show that the adjective *different* is collocating with *other* amplifiers while *very* is collocating with *difficult* and *important*, *pretty* is collocating with *big*, *really* is collocating with *nice*, and *so* is collocating with *bad*. ```{r pca1, eval = F, echo=F, message=FALSE, warning=FALSE} # Principal Component Analysis # inspect data data(iris) head(iris, 3) ``` ```{r pca2, eval = F, echo=F, message=FALSE, warning=FALSE} # log transform log.ir <- log(iris[, 1:4]) ir.species <- iris[, 5] # apply PCA - scale. = TRUE is highly # advisable, but default is FALSE. ir.pca <- prcomp(log.ir, center = TRUE, scale. = TRUE) ``` ```{r pca3, eval = F, echo=F, message=FALSE, warning=FALSE} # print method print(ir.pca) ``` ```{r pca4, eval = F, echo=F, message=FALSE, warning=FALSE} # plot method plot(ir.pca, type = "l") ``` ```{r pca5, eval = F, echo=F, message=FALSE, warning=FALSE} # summary method summary(ir.pca) ``` ```{r pca6, eval = F, echo=F, message=FALSE, warning=FALSE} # predict PCs predict(ir.pca, newdata=tail(log.ir, 2)) ``` ```{r pca7, eval = F, echo=F, message=FALSE, warning=FALSE} # load library library(devtools) # install library from github install_github("vqv/ggbiplot") # load installed library library(ggbiplot) # create plot g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, groups = ir.species, ellipse = TRUE, circle = TRUE) g <- g + scale_color_discrete(name = '') g <- g + theme(legend.direction = 'horizontal', legend.position = 'top') print(g) ``` ```{r pca8, eval = F, echo=F, message=FALSE, warning=FALSE} require(caret) trans = preProcess(iris[,1:4], method=c("BoxCox", "center", "scale", "pca")) PC = predict(trans, iris[,1:4]) ``` ```{r pca9, eval = F, echo=F, message=FALSE, warning=FALSE} # inspect retained PCs head(PC, 3) # inspect loadings trans$rotation ``` ```{r mds1, eval = F, echo=F, message=FALSE, warning=FALSE} # Multidimensional Scaling # Classical MDS # N rows (objects) x p columns (variables) # each row identified by a unique row name d <- dist(clus) # Euclidean distances between the rows fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim fit # view results # plot solution x <- fit$points[,1] y <- fit$points[,2] plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric MDS", type="n") text(x, y, labels = row.names(clus), cex=.7) ``` ```{r mds2, eval = F, echo=F, message=FALSE, warning=FALSE} # Nonmetric MDS # N rows (objects) x p columns (variables) # each row identified by a unique row name library(MASS) d <- dist(clus) # Euclidean distances between the rows fit <- isoMDS(d, k=2) # k is the number of dim fit # view results # plot solution x <- fit$points[,1] y <- fit$points[,2] plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Nonmetric MDS", type="n") text(x, y, labels = row.names(clus), cex=.7) ``` # Citation & Session Info {-} Schweinberger, Martin. 2023. *Cluster and Correspondence Analysis in R*. Brisbane: The University of Queensland. url: https://ladal.edu.au/clust.html (Version 2023.05.31). ``` @manual{schweinberger2023clust, author = {Schweinberger, Martin}, title = {Cluster and Correspondence Analysis in R}, note = {https://ladal.edu.au/clust.html}, year = {2023}, organization = "The University of Queensland, Australia. School of Languages and Cultures}, address = {Brisbane}, edition = {2023.05.31} } ``` ```{r fin} sessionInfo() ``` *** [Back to top](#introduction) [Back to HOME](https://ladal.edu.au) *** # References {-}