This tutorial introduces Text Similarity (see Zahrotun 2016; Li and Han 2013), i.e. how close or similar two pieces of text are with respect to either their use of words or characters (lexical similarity) or in terms of meaning (semantic similarity).This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to assess the similarity of texts in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with assessing text similarity.

The entire R Notebook for the tutorial can be downloaded **here**. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the **bibliography file** and store it in the same folder where you store the Rmd file.

*Lexical Similarity* provides a measure of the similarity of two texts based on the intersection of the word sets of same or different languages. A lexical similarity of 1 suggests that there is complete overlap between the vocabularies while a score of 0 suggests that there are no common words in the two texts. There are several different ways of evaluating lexical similarity such as Jaccard Similarity, Cosine Similarity, Levenshtein Distance etc.

*Semantic Similarity* on the other hand measures the similarity between two texts based on their meaning rather than their lexicographical similarity. Semantic similarity is highly useful for summarizing texts and extracting key attributes from large documents or document collections. Semantic Similarity can be evaluated using methods such as *Latent Semantic Analysis* (LSA), *Normalised Google Distance* (NGD), *Salient Semantic Analysis* (SSA) etc.

As a part of this tutorial we will focus primarily on Lexical Similarity. We begin with a brief overview of relevant concepts and then show different measures can be implemented in R.

The Jaccard similarity is defined as an intersection of two texts divided by the union of that two documents. In other words it can be expressed as the number of common words over the total number of the words in the two texts or documents. The Jaccard similarity of two documents ranges from 0 to 1, where 0 signifies no similarity and 1 signifies complete overlap.The mathematical representation of the Jaccard Similarity is shown below: -

\[\begin{equation} J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B |} = \frac{|A \bigcap B|}{|A| + |B| - |A \bigcap B|} \end{equation}\]

In case of cosine similarity the two documents are represented in a n-dimensional vector space with each word represented in a vector form. Thus the cosine similarity metric measures the cosine of the angle between two n-dimensional vectors projected in a multi-dimensional space. The cosine similarity ranges from 0 to 1. A value closer to 0 indicates less similarity whereas a score closer to 1 indicates more similarity.The mathematical representation of the Cosine Similarity is shown below: -

\[\begin{equation} similarity = cos(\theta) = \frac{A \cdot B}{||A|| ||B||} = \frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}} \end{equation}\]

Levenshtein distance comparison is generally carried out between two words. It determines the minimum number of single character edits required to change one word to another. The higher the number of edits more are the texts different from each other.An edit is defined by either an insertion of a character, a deletion of character or a replacement of a character. For two words *a* and *b* with lengths *i* and *j* the Levenshtein distance is defined as follows: -

\[\begin{equation} lev_{a,b}(i,j) = \begin{cases} max(i,j) & \quad \text{if min(i,j) = 0,}\\ min \begin{cases} lev_{a,b}(i-1,j)+1 \\ lev_{a,b}(i, j-1)+1 & \text{otherwise.}\\ lev_{a,b}(i-1,j-1)+1_{(a_{i} \neq b_{j})} \\ \end{cases} \end{cases} \end{equation}\]

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).

```
# set options
options(stringsAsFactors = F)
# install libraries
install.packages("stringdist")
install.packages("hashr")
install.packages("tidyverse")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")
```

Now that we have installed the packages, we activate them as shown below.

```
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# activate packages
library(stringdist)
library(hashr)
library(tidyverse)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()
```

Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.

For evaluating the similarity scores and the edit distance for the above discussed methods in R we have installed the *stringdist* package and will be primarily using two functions in that: *stringdist* and *stringsim*. We are also utilising the *hashr* package so that Jaccard and cosine similarity are evaluated word wise instead of letter wise. The sentence is tokenised and the corresponding list of words are hashed so that the sentences are transformed into a list of integers.For the Jaccard and the Cosine similarity we will be using the same set of texts whereas for the Levenshtein edit distance we will take 3 pairs of words to illustrate *insert*, *delete* and *replace* operations.

```
text1 = "The quick brown fox jumped over the wall"
text2 = "The fast brown fox leaped over the wall"
insert_ex = c("Marta","Martha")
del_ex = c("Genome","Gnome")
rep_ex = c("Tim","Tom")
```

```
# Using the seq_dist function along with hash function to calculate the Jaccard similarity word-wise
jac_sim_score = seq_dist(hash(strsplit(text1, "\\s+")), hash(strsplit(text2, "\\s+")), method = "jaccard",q=2)
print(paste0("The Jaccard similarity for the two texts is ",jac_sim_score))
```

`## [1] "The Jaccard similarity for the two texts is 0.727272727272727"`

```
# Using the seq_dist function along with hash function to calculate the Jaccard similarity word-wise
cos_sim_score = seq_dist(hash(strsplit(text1, "\\s+")), hash(strsplit(text2, "\\s+")), method = "cosine",q=2)
print(paste0("The Cosine similarity for the two texts is ",cos_sim_score))
```

`## [1] "The Cosine similarity for the two texts is 0.571428571428572"`

```
# Insert edit
ins_edit = stringdist(insert_ex[1],insert_ex[2],method = "lv")
print(paste0("The insert edit distance for ",insert_ex[1]," and ",insert_ex[2]," is ",ins_edit))
```

`## [1] "The insert edit distance for Marta and Martha is 1"`

```
# Delete edit
del_edit = stringdist(del_ex[1],del_ex[2],method = "lv")
print(paste0("The delete edit distance for ",del_ex[1]," and ",del_ex[2]," is ",del_edit))
```

`## [1] "The delete edit distance for Genome and Gnome is 1"`

```
# Replace edit
rep_edit = stringdist(rep_ex[1],rep_ex[2],method = "lv")
print(paste0("The replace edit distance for ",rep_ex[1]," and ",rep_ex[2]," is ",rep_edit))
```

`## [1] "The replace edit distance for Tim and Tom is 1"`

As shown above, the Jaccard and Cosine similarity scores are different which is important to note when using different measures to determine similarity. The differences are primarily primarily caused because Jaccard takes only the unique words in the two texts into consideration whereas the Cosine similarity approach takes the total length of the vectors into consideration. For the Levenshtein edit distance, the examples provided above show that for the first case we have to insert an extra *h*, for the second we have to delete an *e* and for the last case we need to replace *i* with *o*. Thus, for all the pairs taken into account here the edit distance is 1.

Majumdar, Dattatreya. 2022. *Lexical Text Similarity using R*. Brisbane: The University of Queensland. url: https://slcladal.github.io/lexsim.html (Version 2022.09.13).

```
@manual{Majumdar2022ta,
author = {Majumdar, Dattatreya},
title = {Text Analysis and Distant Reading using R},
note = {https://slcladal.github.io/lexsim.html},
year = {2022},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2022.09.13}
}
```

`sessionInfo()`

```
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] flextable_0.7.3 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
## [5] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
## [9] ggplot2_3.3.6 tidyverse_1.3.2 hashr_0.1.4 stringdist_0.9.8
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.3 sass_0.4.1 jsonlite_1.8.0
## [4] modelr_0.1.8 bslib_0.3.1 assertthat_0.2.1
## [7] highr_0.9 renv_0.15.4 googlesheets4_1.0.0
## [10] cellranger_1.1.0 yaml_2.3.5 gdtools_0.2.4
## [13] pillar_1.7.0 backports_1.4.1 glue_1.6.2
## [16] uuid_1.1-0 digest_0.6.29 rvest_1.0.2
## [19] colorspace_2.0-3 htmltools_0.5.2 pkgconfig_2.0.3
## [22] broom_1.0.0 haven_2.5.0 scales_1.2.0
## [25] officer_0.4.3 tzdb_0.3.0 googledrive_2.0.0
## [28] generics_0.1.3 ellipsis_0.3.2 withr_2.5.0
## [31] klippy_0.0.0.9500 cli_3.3.0 magrittr_2.0.3
## [34] crayon_1.5.1 readxl_1.4.0 evaluate_0.15
## [37] fs_1.5.2 fansi_1.0.3 xml2_1.3.3
## [40] tools_4.2.1 data.table_1.14.2 hms_1.1.1
## [43] gargle_1.2.0 lifecycle_1.0.1 munsell_0.5.0
## [46] reprex_2.0.1 zip_2.2.0 compiler_4.2.1
## [49] jquerylib_0.1.4 systemfonts_1.0.4 rlang_1.0.4
## [52] grid_4.2.1 base64enc_0.1-3 rmarkdown_2.14
## [55] gtable_0.3.0 DBI_1.1.3 R6_2.5.1
## [58] lubridate_1.8.0 knitr_1.39 fastmap_1.1.0
## [61] utf8_1.2.2 stringi_1.7.8 parallel_4.2.1
## [64] Rcpp_1.0.8.3 vctrs_0.4.1 dbplyr_2.2.1
## [67] tidyselect_1.1.2 xfun_0.31
```

Li, Baoli, and Liping Han. 2013. “Distance Weighted Cosine Similarity Measure for Text Classification.” In *International Conference on Intelligent Data Engineering and Automated Learning*, 611–18. Springer.

Zahrotun, Lisna. 2016. “Comparison Jaccard Similarity, Cosine Similarity and Combined Both of the Data Clustering with Shared Nearest Neighbor Method.” *Computer Engineering and Applications Journal* 5 (1): 11–18.