Introduction

This tutorial introduces lexicography with R and shows how to use R to create dictionaries and find synonyms through determining semantic similarity in R. While the initial example focuses on English, subsequent sections show how easily this approach can be generalized to languages other than English (e.g. German, French, Spanish, Italian, or Dutch). The entire R-markdown document for the sections below can be downloaded here.

Traditionally, dictionaries are listing of words that are commonly arranged alphabetically, which may include information on definitions, usage, etymologies, pronunciations, translation, etc. (see Agnes, Goldman, and Soltis 2002; Steiner 1985). If such dictionaries, that are typically published as books contain translations of words in other languages, they are referred to as lexicons. Therefore, lexicographical references show the inter-relationships among lexical data, i.e. words.

Similarly, in computational linguistics, dictionaries represent a specific format of data where elements are linked to or paired with other elements in a systematic way. Computational lexicology refers to a branch of computational linguistics, which is concerned with the use of computers in the study of lexicons. Hence, computational lexicology has been defined as the use of computers in the study of machine-readable dictionaries (see e.g. Amsler 1981). Computational lexicology is distinguished from computational lexicography, which can be defined as the use of computers in the construction of dictionaries which is the focus of this tutorial. It should be noted, thought, that computational lexicology and computational lexicography are often used synonymously.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install packages
install.packages("dplyr")
install.packages("stringr")
install.packages("udpipe")
install.packages("tidytext")
install.packages("coop")
install.packages("cluster")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

In a next step, we load the packages.

# load packages
library(dplyr)
library(stringr)
library(udpipe)
library(tidytext)
library(coop)
library(cluster)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R and RStudio and once you have initiated the session by executing the code shown above, you are good to go.

1 Creating dictionaries

In a first step, we load a text. In this case, we load George Orwell’s Nineteen Eighty-Four.

text <- readLines("https://slcladal.github.io/data/orwell.txt") %>%
  paste0(collapse = " ")
# show the first 500 characters of the text
substr(text, start=1, stop=500)
## [1] "1984 George Orwell Part 1, Chapter 1 It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him. The hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It "

Next, we download a udpipe language model. In this case, we download a udpipe language model for English, but you can download udpipe language models for more than 60 languages.

# download language model
m_eng   <- udpipe::udpipe_download_model(language = "english-ewt")

In my case, I have stored this model in a folder called udpipemodels and you can load it (if you have also save the model in a folder called udpipemodels within your Rproj folder as shown below. )

# load language model from your computer after you have downloaded it once
m_eng <- udpipe_load_model(file = here::here("udpipemodels", "english-ewt-ud-2.5-191206.udpipe"))

In a next step, we implement the part-of-speech tagger.

# tokenise, tag, dependency parsing
text_ann <- udpipe::udpipe_annotate(m_eng, x = text) %>%
  # convert into a data frame
  as.data.frame() %>%
  # remove columns we do not need
  dplyr::select(-sentence, -paragraph_id, -sentence_id, -feats, 
                -head_token_id, -dep_rel, -deps, -misc)
# inspect
head(text_ann, 10)
##    doc_id token_id   token   lemma  upos xpos
## 1    doc1        1    1984    1984 PROPN  NNP
## 2    doc1        2  George  George PROPN  NNP
## 3    doc1        3  Orwell  Orwell PROPN  NNP
## 4    doc1        4    Part    part PROPN  NNP
## 5    doc1        5       1       1   NUM   CD
## 6    doc1        6       ,       , PUNCT    ,
## 7    doc1        7 Chapter chapter PROPN  NNP
## 8    doc1        8       1       1   NUM   CD
## 9    doc1        1      It      it  PRON  PRP
## 10   doc1        2     was      be   AUX  VBD

We can now use the resulting table to generate a first, basic dictionary that holds information about the word form (token), the part-of speech tag (upos), the lemmatized word type (lemma), and the frequency with which the word form is used as that part-of speech.

# generate dictionary
text_dict_raw <- text_ann %>%
  # remove non-words
  dplyr::filter(!stringr::str_detect(token, "\\W")) %>%
  # filter out numbers
  dplyr::filter(!stringr::str_detect(token, "[0-9]")) %>%
  # group data
  dplyr::group_by(token, lemma, upos) %>%
  # summarize data
  dplyr::summarise(frequency = dplyr::n()) %>%
  # arrange by frequency
  dplyr::arrange(-frequency)
# inspect
head(text_dict_raw, 10)
## # A tibble: 10 × 4
## # Groups:   token, lemma [10]
##    token lemma upos  frequency
##    <chr> <chr> <chr>     <int>
##  1 the   the   DET        5249
##  2 of    of    ADP        2908
##  3 a     a     DET        2277
##  4 and   and   CCONJ      2064
##  5 was   be    AUX        1795
##  6 in    in    ADP        1446
##  7 to    to    PART       1336
##  8 it    it    PRON       1295
##  9 he    he    PRON       1270
## 10 had   have  AUX        1018

The above display is ordered by frequency but it is, of course more common, to arrange dictionaries alphabetically. To do this, we can simply use the àrrange function from the dplyr package as shown below.

# generate dictionary
text_dict <- text_dict_raw %>%
  # arrange alphabetically
  dplyr::arrange(token)
# inspect
head(text_dict, 10)
## # A tibble: 10 × 4
## # Groups:   token, lemma [7]
##    token     lemma     upos  frequency
##    <chr>     <chr>     <chr>     <int>
##  1 a         a         DET        2277
##  2 A         a         DET         107
##  3 A         a         NOUN          1
##  4 Aaronson  Aaronson  PROPN         8
##  5 aback     aback     ADV           1
##  6 aback     aback     NOUN          1
##  7 abandon   abandon   VERB          2
##  8 abandon   abandon   ADP           1
##  9 abandoned abandon   VERB          4
## 10 abasement abasement NOUN          1

We have now generated a basic dictionary of English but, as you can see above, there are still some errors as the part-of-speech tagging was not perfect. As such, you will still need to check and edit the results manually but you have already a rather clean dictionary based on George Orwell’s Nineteen Eighty-Four to work with.

Correcting and Extending Dictionaries

Fortunately, it is very easy in R to correct entries, i.e., changing lemmas or part-of-speech tags, and to extend entries, i.e., adding additional layers of information such as urls or examples.

We will begin to extend our dictionary by adding an additional column (called annotation) in which we will add information.

# generate dictionary
text_dict_ext <- text_dict %>%
  # removing an entry
  dplyr::filter(!(lemma == "a" & upos == "NOUN")) %>%
  # editing entries
  dplyr::mutate(upos = ifelse(lemma == "aback" & upos == "NOUN", "PREP", upos)) %>%
  # adding comments 
  dplyr::mutate(comment = dplyr::case_when(lemma == "a" ~ "also an before vowels",
                                           lemma == "Aaronson" ~ "Name of someone.", 
                                           T ~ ""))
# inspect
head(text_dict_ext, 10)
## # A tibble: 10 × 5
## # Groups:   token, lemma [8]
##    token     lemma     upos  frequency comment                
##    <chr>     <chr>     <chr>     <int> <chr>                  
##  1 a         a         DET        2277 "also an before vowels"
##  2 A         a         DET         107 "also an before vowels"
##  3 Aaronson  Aaronson  PROPN         8 "Name of someone."     
##  4 aback     aback     ADV           1 ""                     
##  5 aback     aback     PREP          1 ""                     
##  6 abandon   abandon   VERB          2 ""                     
##  7 abandon   abandon   ADP           1 ""                     
##  8 abandoned abandon   VERB          4 ""                     
##  9 abasement abasement NOUN          1 ""                     
## 10 abashed   abashed   VERB          1 ""

To make it a bit more interesting but also keep this tutorial simple and straight-forward, we will add information about the polarity and emotionally of the words in our dictionary. We can do this by performing a sentiment analysis on the lemmas using the tidytext package.

The tidytext package contains three sentiment dictionaries (nrc, bing, and afinn). For the present purpose, we use the ncrdictionary which represents the Word-Emotion Association Lexicon (Mohammad and Turney 2013). The Word-Emotion Association Lexicon which comprises 10,170 terms, and in which lexical elements are assigned scores based on ratings gathered through the crowd-sourced Amazon Mechanical Turk service. For the Word-Emotion Association Lexicon raters were asked whether a given word was associated with one of eight emotions. The resulting associations between terms and emotions are based on 38,726 ratings from 2,216 raters who answered a sequence of questions for each word which were then fed into the emotion association rating (cf. Mohammad and Turney 2013). Each term was rated 5 times. For 85 percent of words, at least 4 raters provided identical ratings. For instance, the word cry or tragedy are more readily associated with SADNESS while words such as happy or beautiful are indicative of JOY and words like fit or burst may indicate ANGER. This means that the sentiment analysis here allows us to investigate the expression of certain core emotions rather than merely classifying statements along the lines of a crude positive-negative distinction.

To be able to use the Word-Emotion Association Lexicon we need to add another column to our data frame called word which simply contains the lemmatized word. The reason is that the lexicon expects this column and only works if it finds a word column in the data. The code below shows how to add the emotion and polarity entries to our dictionary.

# generate dictionary
text_dict_snt <- text_dict_ext %>%
  dplyr::mutate(word = lemma) %>%
  dplyr::left_join(get_sentiments("nrc")) %>%
  dplyr::group_by(token, lemma, upos, comment) %>%
  dplyr::summarise(sentiment = paste0(sentiment, collapse = ", "))
# inspect
head(text_dict_snt, 10) 
## # A tibble: 10 × 5
## # Groups:   token, lemma, upos [10]
##    token     lemma     upos  comment                 sentiment              
##    <chr>     <chr>     <chr> <chr>                   <chr>                  
##  1 a         a         DET   "also an before vowels" NA                     
##  2 A         a         DET   "also an before vowels" NA                     
##  3 Aaronson  Aaronson  PROPN "Name of someone."      NA                     
##  4 aback     aback     ADV   ""                      NA                     
##  5 aback     aback     PREP  ""                      NA                     
##  6 abandon   abandon   ADP   ""                      fear, negative, sadness
##  7 abandon   abandon   VERB  ""                      fear, negative, sadness
##  8 abandoned abandon   VERB  ""                      fear, negative, sadness
##  9 abasement abasement NOUN  ""                      NA                     
## 10 abashed   abashed   VERB  ""                      NA

The resulting extended dictionary now contains not only the token, the lemma, and the pos-tag but also the sentiment from the Word-Emotion Association Lexicon.

Generating dictionaries for other languages

As mentioned above, the procedure for generating dictionaries can easily be applied to languages other than English. If you want to follow exactly the procedure described above, then the language set of the TreeTagger is the limiting factors as its R implementation only supports English, German, French, Italian, Spanish, and Dutch. fa part-of-speech tagged text in another language is already available to you, and you do not require the TreeTagger for the part-of-speech tagging, then you can skip the code chunk that is related to the tagging and you can modify the procedure described above to virtually any language.

We will now briefly create a German dictionary based on a subsection of the fairy tales collected by the brothers Grimm to show how the above procedure can be applied to a language other than English. In a first step, we load a German text into R.

grimm <- readLines("https://slcladal.github.io/data/GrimmsFairytales.txt",
                   encoding = "latin1") %>%
  paste0(collapse = " ")
# show the first 500 characters of the text
substr(grimm, start=1, stop=200)
## [1] "Der Froschkönig oder der eiserne Heinrich  Ein Märchen der Brüder Grimm Brüder Grimm  In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön; aber die"

Next, we download a udpipe language model. In this case, we download a udpipe language model for German, but you can download udpipe language models for more than 60 languages.

# download language model
udpipe::udpipe_download_model(language = "german-hdt")

In my case, I have stored this model in a folder called udpipemodels and you can load it (if you have also save the model in a folder called udpipemodels within your Rproj folder as shown below).

# load language model from your computer after you have downloaded it once
m_ger <- udpipe_load_model(file = here::here("udpipemodels",
                                             #"german-hdt-ud-2.5-191206.udpipe"))
                                             "german-gsd-ud-2.5-191206.udpipe"))

In a next step, we generating the dictionary based on the brothers’ Grimm fairy tales. We go through the same steps as for the English dictionary and collapse all the steps into a single code block.

# tokenise, tag, dependency parsing
grimm_ann <- udpipe::udpipe_annotate(m_ger, x = grimm) %>%
  # convert into a data frame
  as.data.frame() %>%
  # remove non-words
  dplyr::filter(!stringr::str_detect(token, "\\W")) %>%
  # filter out numbers
  dplyr::filter(!stringr::str_detect(token, "[0-9]")) %>%
  dplyr::group_by(token, lemma, upos) %>%
  dplyr::summarise(frequency = dplyr::n()) %>%
  dplyr::arrange(lemma)
# inspect
head(grimm_ann, 10)
## # A tibble: 10 × 4
## # Groups:   token, lemma [8]
##    token    lemma    upos  frequency
##    <chr>    <chr>    <chr>     <int>
##  1 A        A        NOUN          1
##  2 ab       ab       ADP          12
##  3 abends   abend    ADV           2
##  4 Abend    Abend    NOUN          3
##  5 abends   abends   ADV           1
##  6 aber     aber     ADJ           1
##  7 aber     aber     ADV          56
##  8 aber     aber     CCONJ        32
##  9 Aber     aber     CCONJ        16
## 10 abfallen abfallen VERB          1

As with the English dictionary, we have created a customized German dictionary based of a subsample of the brothers’ Grimm fairy tales holding the word form(token), the part-of-speech tag (tag), the lemmatized word type (lemma), the general word class (wclass), ad the frequency with which a word form occurs as a part-of-speech in the data (frequency).

2 Finding synonyms: creating a thesaurus

Another task that is quite common in lexicography is to determine if words share some form of relationship such as whether they are synonyms or antonyms. In computational linguistics, this is commonly determined based on the collocational profiles of words. These collocational profiles are also called word vectors or word embeddings and approaches which determine semantic similarity based on collocational profiles or word embeddings are called distributional approaches (or distributional semantics). The basic assumption of distributional approaches is that words that occur in the same context and therefore have similar collocational profiles are also semantically similar. In fact, various packages, such as qdap or , wordnet already provide synonyms for terms (all of which are based on similar collocational profiles) but we would like to determine if words are similar without knowing it in advance.

In this example, we want to determine if two degree adverbs (such as very, really, so, completely, totally, amazingly, etc.) are synonymous and can therefore be exchanged without changing the meaning of the sentence (or, at least, not changing it dramatically). This is relevant in lexicography as such terms can then be linked to each other and inform readers that these words are interchangeable.

As a first step, we load the data which contains three columns:

  • one column holding the degree adverbs which is called pint

  • one column called adjs holding the adjectives that the degree adverbs have modified

  • one column called remove which contains the word keep and which we will remove as it is not relevant for this tutorial

When loading the data, we

  • remove the remove column

  • rename the pint column as degree_adverb

  • rename the adjs column as adjectives

  • filter out all instances where the degree adverb column has the value 0 (which means that the adjective was not modified)

  • remove instances where well functions as a degree adverb (because it behaves rather differently from other degree adverbs)

# load data
degree_adverbs <- base::readRDS(url("https://slcladal.github.io/data/dad.rda", "rb")) %>%
  dplyr::select(-remove) %>%
  dplyr::rename(degree_adverb = pint,
                adjective = adjs) %>%
  dplyr::filter(degree_adverb != "0",
                degree_adverb != "well")
# inspect
head(degree_adverbs, 10)
##    degree_adverb adjective
## 1           real       bad
## 2         really      nice
## 3           very      good
## 4         really     early
## 5         really       bad
## 6         really       bad
## 7             so      long
## 8         really wonderful
## 9         pretty      good
## 10        really      easy

In a next step, we create a matrix from this data frame which maps how often a given amplifier co-occurred with a given adjective. In text mining, this format is called a text-document matrix or tdm (which is a transposed document-term matrix of dtm).

# tabulate data (create term-document matrix)
tdm <- ftable(degree_adverbs$adjective, degree_adverbs$degree_adverb)
# extract amplifiers and adjectives 
amplifiers <- as.vector(unlist(attr(tdm, "col.vars")[1]))
adjectives <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# attach row and column names to tdm
rownames(tdm) <- adjectives
colnames(tdm) <- amplifiers
# inspect data
tdm[1:5, 1:5]
##           completely extremely pretty real really
## able               0         1      0    0      0
## actual             0         0      0    1      0
## amazing            0         0      0    0      4
## available          0         0      0    0      1
## bad                0         0      1    2      3

In a next step, we extract the expected values of the co-occurrences if the amplifiers were distributed homogeneously and calculate the Pointwise Mutual Information (PMI) score and use that to then calculate the Positive Pointwise Mutual Information (PPMI) scores. According to Levshina (2015) 327 - referring to Bullinaria and Levy (2007) - PPMI perform better than PMI as negative values are replaced with zeros. In a next step, we calculate the cosine similarity which will for the bases for the subsequent clustering.

# compute expected values
tdm.exp <- chisq.test(tdm)$expected
# calculate PMI and PPMI
PMI <- log2(tdm/tdm.exp)
PPMI <- ifelse(PMI < 0, 0, PMI)
# calculate cosine similarity
cosinesimilarity <- cosine(PPMI)
# inspect cosine values
cosinesimilarity[1:5, 1:5]
##            completely   extremely      pretty       real      really
## completely 1.00000000 0.204188725 0.000000000 0.05304354 0.126668434
## extremely  0.20418873 1.000000000 0.007319316 0.00000000 0.004235346
## pretty     0.00000000 0.007319316 1.000000000 0.09441299 0.062323271
## real       0.05304354 0.000000000 0.094412995 1.00000000 0.131957473
## really     0.12666843 0.004235346 0.062323271 0.13195747 1.000000000

As we have now obtained a similarity measure, we can go ahead and perform a cluster analysis on these similarity values. However, as we have to extract the maximum values in the similarity matrix that is not 1 as we will use this to create a distance matrix. While we could also have simply subtracted the cosine similarity values from 1 to convert the similarity matrix into a distance matrix, we follow the procedure proposed by Levshina (2015).

# find max value that is not 1
cosinesimilarity.test <- apply(cosinesimilarity, 1, function(x){
  x <- ifelse(x == 1, 0, x) } )
maxval <- max(cosinesimilarity.test)
# create distance matrix
amplifier.dist <- 1 - (cosinesimilarity/maxval)
clustd <- as.dist(amplifier.dist)

In a next step, we visualize the results of the semantic vector space model as a dendrogram.

# create cluster object
cd <- hclust(clustd, method="ward.D")    
# plot cluster object
plot(cd, main = "", sub = "", yaxt = "n", ylab = "", xlab = "", cex = .8)

The clustering solution shows that, as expected, completely, extremely, and totally - while similar to each other and thus interchangeable with each other - form a separate cluster from all other amplifiers. In addition, real and really form a cluster together. The clustering of very, pretty, so, really, and real suggest that these amplifiers are more or less interchangeable with each other but not with totally, completely, and extremely.

To extract synonyms automatically, we can use the cosine similarity matrix that a´we generated before. This is what we need to do:

  • generate a column called word
  • replace the perfect similarity values of the diagonal with 0
  • look up the lowest value, i.e. the word that has the lowest distance to a given word
  • create a vector which holds those words (the synonym candidates).
syntb <- cosinesimilarity %>%
  as.data.frame() %>%
  dplyr::mutate(word = colnames(cosinesimilarity)) %>%
  dplyr::mutate_each(funs(replace(., . == 1, 0))) %>%
  dplyr::mutate(synonym = colnames(.)[apply(.,1,which.max)]) %>%
  dplyr::select(word, synonym)
syntb
##                  word    synonym
## completely completely  extremely
## extremely   extremely completely
## pretty         pretty       real
## real             real     really
## really         really       real
## so                 so       real
## totally       totally completely
## very             very         so

NOTE

Remember that this is only a tutorial! A proper study would have to take the syntactic context into account because, while we can say This really great tutorial helped me a lot. we probably would not say This so great tutorial helped me a lot. This is because so syntactically more restricted and is strongly disfavored in attributive contexts. Therefore, the syntactic context would have to be considered in a more thorough study.

`

`


There are many more useful methods for identifying semantic similarity. A very useful method (which we have implemented here but only superficially is Semantic Vector Space Modeling. If you want to know more about this, this tutorial by Gede Primahadi Wijaya Rajeg, Karlina Denistia, and Simon Musgrave (Rajeg, Denistia, and Musgrave 2019) is highly recommended and will give a better understanding of SVM but this should suffice to get you started.

3 Creating bilingual dictionaries

Dictionaries commonly contain information about elements. Bilingual or translation dictionaries represent a sub-category of dictionaries that provide a specific type of information about a given word: the translation of that word in another language. In principle, generating translation dictionaries is relatively easy and straight forward. However, not only is the devil hidden in the details but the generation of data-driven translation dictionaries also require a substantial data set consisting of sentences and their translation. This is often quite tricky as well aligned translations are unfortunately, and unexpectedly, rather hard to come by.

Despite these issues, if you have access to clean and well aligned, parallel multilingual data, then you simply need to check which correlation between the word in language A and language B is the highest and you have a likely candidate for its translation. The same procedure can be extended to generate multilingual dictionaries. Problems arise due to grammatical differences between languages, idiomatic expressions, homonymy and polysemy as well as due to word class issues. The latter, word class issues, can be solved by part-of-speech tagging and then only considering words that belong to the same (or realistically similar) parts-of speech. The other issues can also be solved but require substantial amounts of (annotated) data.

To explore how to generate a multilingual lexicon, we load a sample of English sentences and their German translations.

# load translations
translations <- readLines("https://slcladal.github.io/data/translation.txt",
                          encoding = "UTF-8", skipNul = T)

In a next step, we generate separate tables which hold the German and English sentences. However, the sentences and their translations are identified by an identification number (id) so that we keep the information about which sentence is linked to which translation.

# german sentences
german <- str_remove_all(translations, " — .*") %>%
  str_remove_all(., "[:punct:]")
# english sentences
english <- str_remove_all(translations, ".* — ") %>%
  str_remove_all(., "[:punct:]")
# sentence id
sentence <- 1:length(german)
# combine into table
germantb <- data.frame(sentence, german)
# sentence id
sentence <- 1:length(english)
# combine into table
englishtb <- data.frame(sentence, english)

We now unnest the tokens (split the sentences into words) and subsequently add the translations which we again unnest. The resulting table consists of two columns holding German and English words. The relevant point here is that each German word is linked with each English word that occurs in the translated sentence.

library(plyr)
# tokenize by sentence: german
transtb <- germantb %>%
  unnest_tokens(word, german) %>%
  # add english data
  plyr::join(., englishtb, by = "sentence") %>%
  unnest_tokens(trans, english) %>%
  dplyr::rename(german = word,
                english = trans) %>%
  dplyr::select(german, english) %>%
  dplyr::mutate(german = factor(german),
                english = factor(english))

Based on this table, we can now generate a term-document matrix which shows how frequently each word co-occurred in the translation of any of the sentences. For instance, the German word alles occurred one time in a translation of a sentence which contained the English word all.

# tabulate data (create term-document matrix)
tdm <- ftable(transtb$german, transtb$english)
# extract amplifiers and adjectives 
german <- as.vector(unlist(attr(tdm, "col.vars")[1]))
english <- as.vector(unlist(attr(tdm, "row.vars")[1]))
# attach row and column names to tdm
rownames(tdm) <- english
colnames(tdm) <- german
# inspect data
tdm[1:10, 1:10]
##          a accident all am ambulance an and any anything are
## ab       0        0   0  0         0  0   0   0        0   0
## abend    0        0   0  0         0  0   0   0        0   0
## allem    0        0   0  0         0  0   0   0        0   0
## alles    0        0   1  0         0  0   0   0        0   0
## am       0        0   0  0         0  0   0   0        0   0
## an       0        0   0  0         0  0   0   0        0   0
## anderen  1        0   0  0         0  0   0   0        0   0
## apotheke 1        0   0  1         0  0   0   0        0   0
## arzt     1        0   0  0         0  0   0   0        0   0
## auch     3        0   0  0         0  0   0   0        1   0

Now, we reformat this co-occurrence matrix so that we have the frequency information that is necessary for setting up 2x2 contingency tables which we will use to calculate the co-occurrence strength between each word and its potential translation.

coocdf <- as.data.frame(as.matrix(tdm))
cooctb <- coocdf %>%
  dplyr::mutate(German = rownames(coocdf)) %>%
  tidyr::gather(English, TermCoocFreq,
                colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]) %>%
  dplyr::mutate(German = factor(German),
                English = factor(English)) %>%
  dplyr::mutate(AllFreq = sum(TermCoocFreq)) %>%
  dplyr::group_by(German) %>%
  dplyr::mutate(TermFreq = sum(TermCoocFreq)) %>%
  dplyr::ungroup(German) %>%
  dplyr::group_by(English) %>%
  dplyr::mutate(CoocFreq = sum(TermCoocFreq)) %>%
  dplyr::arrange(German) %>%
  dplyr::mutate(a = TermCoocFreq,
                b = TermFreq - a,
                c = CoocFreq - a, 
                d = AllFreq - (a + b + c)) %>%
  dplyr::mutate(NRows = nrow(coocdf))%>%
  dplyr::filter(TermCoocFreq > 0)

In a final step, we extract those potential translations that correlate most strongly with each given term. The results then form a list of words and their most likely translation.

translationtb <- cooctb  %>%
  dplyr::rowwise() %>%
  dplyr::mutate(p = round(as.vector(unlist(fisher.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1])), 5), 
                x2 = round(as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1])), 3)) %>%
  dplyr::mutate(phi = round(sqrt((x2/(a + b + c + d))), 3),
                expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
  dplyr::filter(TermCoocFreq > expected) %>%
  dplyr::arrange(-phi) %>%
  dplyr::select(-AllFreq, -a, -b, -c, -d, -NRows, -expected)

The results show that even using the very limited data base can produce some very reasonable results. In fact, based on the data that we used here, the first translations appear to be very sensible, but the mismatches also show that more data is required to disambiguate potential translations.

While this method still requires manual correction, it is a very handy and useful tool for generating custom bilingual dictionaries that can be extended to any set of languages as long as these languages can be represented as distinct words and as long as parallel data is available.

4 Going further: crowd-sourced dictionaries with R and Git

While it would go beyond the scope of this tutorial, it should be noted that the approach for creating dictionaries can be applied to crowed-sourced dictionaries. To do this, you could, e.g. upload your dictionary to a Git repository such as GitHub or GitLab which would then allow everybody with an account on either of these platforms to add content to the dictionary.

To add to the dictionary, contributors would simply have to fork the repository of the dictionary and then merge with the existing, original dictionary repository. The quality of the data would meanwhile remain under control of the owner of the original repository he they can decide on a case-by-case basis which change they would like to accept. In addition, and because Git is a version control environment, the owner could also go back to previous versions, if they think they erroneously accepted a change (merge).

This option is particularly interesting for the approach to creating dictionaries presented here because R Studio has an integrated and very easy to use pipeline to Git (see, e.g., here and here)

We have reached the end of this tutorial and you now know how to create and modify networks in R and how you can highlight aspects of your data.



Citation & Session Info

Schweinberger, Martin. 2022. Lexicography with R. Brisbane: The University of Queensland. url: https://slcladal.github.io/lex.html (Version 2022.08.16).

@manual{schweinberger2022lex,
  author = {Schweinberger, Martin},
  title = {Lexicography with R},
  note = {https://slcladal.github.io/lex.html},
  year = {2022},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.08.16}
}
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] plyr_1.8.7      flextable_0.7.0 cluster_2.1.3   coop_0.6-3     
## [5] tidytext_0.3.3  udpipe_0.8.9    stringr_1.4.0   dplyr_1.0.9    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3      here_1.0.1        lattice_0.20-45   tidyr_1.2.0      
##  [5] assertthat_0.2.1  rprojroot_2.0.3   digest_0.6.29     utf8_1.2.2       
##  [9] R6_2.5.1          evaluate_0.15     highr_0.9         pillar_1.7.0     
## [13] gdtools_0.2.4     rlang_1.0.2       uuid_1.1-0        rstudioapi_0.13  
## [17] data.table_1.14.2 textdata_0.4.2    jquerylib_0.1.4   Matrix_1.4-1     
## [21] klippy_0.0.0.9500 rmarkdown_2.14    readr_2.1.2       compiler_4.2.1   
## [25] janeaustenr_0.1.5 xfun_0.30         pkgconfig_2.0.3   systemfonts_1.0.4
## [29] base64enc_0.1-3   htmltools_0.5.2   tidyselect_1.1.2  tibble_3.1.7     
## [33] fansi_1.0.3       crayon_1.5.1      tzdb_0.3.0        rappdirs_0.3.3   
## [37] SnowballC_0.7.0   grid_4.2.1        jsonlite_1.8.0    lifecycle_1.0.1  
## [41] DBI_1.1.2         magrittr_2.0.3    tokenizers_0.2.1  zip_2.2.0        
## [45] cli_3.3.0         stringi_1.7.6     renv_0.15.4       fs_1.5.2         
## [49] xml2_1.3.3        bslib_0.3.1       ellipsis_0.3.2    generics_0.1.2   
## [53] vctrs_0.4.1       tools_4.2.1       glue_1.6.2        officer_0.4.2    
## [57] purrr_0.3.4       hms_1.1.1         fastmap_1.1.0     yaml_2.3.5       
## [61] knitr_1.39        sass_0.4.1

References


Back to top

Back to HOME


Agnes, Michael, Jonathan L Goldman, and Katherine Soltis. 2002. Webster’s New World Compact Desk Dictionary and Style Guide. Hungry Minds.
Amsler, Robert Alfred. 1981. The Structure of the Merriam-Webster Pocket Dictionary. Austin, TX: he University of Texas at Austin.
Bullinaria, J. A., and J. P. Levy. 2007. “Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study.” Behavior Research Methods 39: 510–26.
Levshina, Natalia. 2015. How to Do Linguistics with r: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company.
Mohammad, Saif M, and Peter D Turney. 2013. “Crowdsourcing a Word-Emotion Association Lexicon.” Computational Intelligence 29 (3): 436–65.
Rajeg, Gede Primahadi Wijaya, Karlina Denistia, and Simon Musgrave. 2019. “R Markdown Notebook for Vector Space Model and the Usage Patterns of Indonesian Denominal Verbs.” https://doi.org10.6084/m9.figshare.9970205. https://figshare.com/articles/R\%5FMarkdown\%5FNotebook\%5Ffor\%5Fi\%5FVector\%5Fspace\%5Fmodel\%5Fand\%5Fthe\%5Fusage\%5Fpatterns\%5Fof\%5FIndonesian\%5Fdenominal\%5Fverbs\%5Fi\%5F/9970205.
Steiner, Roger J. 1985. “Dictionaries. The Art and Craft of Lexicography.” Dictionaries: Journal of the Dictionary Society of North America 7 (1): 294–300.