Comparing Methods for Conceptual Maps

Author

Gerold Schneider

Introduction

Conceptual maps are a distributional semantics method that gives a bird’s-eye view of a large dataset. This showcase compares several different approaches to building and visualising conceptual maps from the same corpus, allowing us to assess what each method reveals — and what it obscures.

In this tutorial, we will:

  1. Train a word2vec semantic space on the COOEE corpus (Australian historical letters)
  2. Test the semantic space with nearest-neighbour queries
  3. Build conceptual maps using six different layout methods:
    • t-SNE — non-linear dimensionality reduction (interactive via plotly)
    • igraph with Fruchterman-Reingold — force-directed graph layout
    • igraph with DRL — a scalable force-directed algorithm
    • ForceAtlas2 — an animated force-directed algorithm popular in Gephi
    • UMAP — non-linear dimensionality reduction with strong local structure
    • Textplot + GML — a pre-computed graph imported from an external tool
Related Tutorial

This showcase is a companion to the main Conceptual Maps tutorial, which introduces the core concepts. The focus here is on comparing methods: understanding what each layout algorithm reveals, and which works best for which purpose.

Prerequisite Tutorials

Before working through this tutorial, we recommend familiarity with:

Learning Objectives

By the end of this tutorial you will be able to:

  1. Train and load a word2vec model using wordVectors
  2. Query nearest neighbours in a semantic space
  3. Build a word–word cosine similarity matrix and convert it to an igraph object
  4. Visualise a semantic network with six different layout algorithms
  5. Critically compare the strengths and weaknesses of each layout method
Citation

Schneider, Gerold. 2026. Comparing Methods for Conceptual Maps. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/conceptualmaps_showcase/conceptualmaps_showcase.html (Version 2026.05.01).


The COOEE Corpus

Section Overview

What you will learn: What the COOEE corpus is; how to download it; and what the period labels embedded in the text mean

The COOEE corpus (Corpus of Oz Early English) consists of Australian English letters written between 1788 and 1900. It is an ideal corpus for exploring distributional semantics across historical periods because:

  • It is large enough to train a meaningful word2vec model (~10 MB of text)
  • It contains temporal labels embedded directly in the text, allowing semantic queries about specific periods
  • The content reflects the dramatic social changes of colonial Australia

The corpus has been prepared with period labels embedded in the running text as pseudo-words:

Label Period
periodone 1788–1825
periodtwo 1826–1850
periodthree 1851–1875
periodfour 1875–1900

This means we can query closest_to(training, "periodone", 30) and receive the words most strongly associated with that historical period.

Downloading the corpus

Data File Required

The COOEE corpus file (ALL_byperiod_nomarkup.txt) must be present in tutorials/conceptualmaps_showcase/data/ before running any of the code in this tutorial. Download it using the code below on first use.

Code
# Create the data folder if it does not exist
dir.create("tutorials/conceptualmaps_showcase/data", recursive = TRUE, showWarnings = FALSE)

# Download the COOEE corpus (run once)
download.file(
  url      = "https://ladal.edu.au/tutorials/conceptualmaps_showcase/data/ALL_byperiod_nomarkup.txt",
  destfile = "tutorials/conceptualmaps_showcase/data/ALL_byperiod_nomarkup.txt",
  mode     = "wb"
)
Code
# Path to the corpus file — adjust if your project structure differs
corpus_file <- here::here("tutorials/conceptualmaps_showcase/data/ALL_byperiod_nomarkup.txt")

Setup

Installing packages

GitHub-only packages and igraph compatibility

wordVectors and ForceAtlas2 are not on CRAN — install both from GitHub using remotes.

ForceAtlas2 uses igraph::get.adjacency() internally, which was deprecated in igraph ≥ 1.3 (replaced by as_adjacency_matrix()). This may produce deprecation warnings on recent igraph versions but should still run. If you encounter errors in the ForceAtlas2 sections, check the ForceAtlas2 GitHub issues for a patched version.

Code
# CRAN packages
install.packages(c(
  "igraph",        # graph construction and layout algorithms
  "tidyverse",     # data manipulation
  "tidytext",      # stopword lists
  "ggplot2",       # plotting
  "ggrepel",       # non-overlapping text labels
  "reshape2",      # data reshaping
  "Rtsne",         # t-SNE dimensionality reduction
  "plotly",        # interactive plots
  "htmlwidgets",   # save interactive HTML widgets
  "scales",        # rescaling values
  "tsne",          # t-SNE dimensionality reduction
  "uwot"           # UMAP dimensionality reduction
))

# GitHub-only packages
remotes::install_github("bmschmidt/wordVectors")
remotes::install_github("analyxcompany/ForceAtlas2")

Loading packages

Code
library(igraph)
library(tidyverse)
library(tidytext)
library(ggplot2)
library(ggrepel)
library(reshape2)
library(Rtsne)
library(plotly)
library(htmlwidgets)
library(scales)
library(uwot)
library(wordVectors)
library(ForceAtlas2)

Training the Semantic Space

Section Overview

What you will learn: How to use wordVectors to prepare and train a word2vec model; the effect of window size on the resulting semantic space; and how to load a pre-trained model to avoid re-training

The word2vec algorithm learns a vector representation for every word in the corpus such that words appearing in similar contexts receive similar vectors. The key hyperparameter is the window size: how many words on either side of the target word are considered context. Larger windows capture deeper, more topical semantics; smaller windows capture more syntactic and collocational relationships.

Preparing the corpus

The prep_word2vec() function tokenises and lowercases the raw text, producing a cleaned version ready for training:

Code
prep_word2vec(
  origin      = corpus_file,
  destination = here::here("tutorials/conceptualmaps_showcase/data/ALL_byperiod_nomarkup_out.txt"),
  lowercase   = TRUE
)

Training (run once)

Training Takes Time

Training with window size 10 takes approximately 4 minutes. Run this block once and then load the saved .bin file instead. The force = TRUE argument overwrites any existing model.

Code
# Window size 10 — recommended default (4 minutes)
training <- train_word2vec(
  train_file  = here::here("tutorials/conceptualmaps_showcase/data/ALL_byperiod_nomarkup_out.txt"),
  output_file = here::here("tutorials/conceptualmaps_showcase/data/ALL_byperiod_nomarkup_w10.bin"),
  threads     = 4,
  vectors     = 200,
  window      = 10,
  force       = TRUE
)

# Uncomment to try larger windows (slower, deeper semantics):
# window 20 (~10 min):
# training <- train_word2vec(..., output_file = "...w20.bin", window = 20)
# window 50 (~20 min):
# training <- train_word2vec(..., output_file = "...w50.bin", window = 50)

Loading the trained model

After training once, always load the saved .bin file directly:

Code
# Load pre-trained model (fast — no re-training)
model_file <- here::here("tutorials/conceptualmaps_showcase/data/ALL_byperiod_nomarkup_w10.bin")

training <- read.binary.vectors(
  filename      = model_file,
  nrows         = Inf,
  cols          = "All",
  rowname_list  = NULL,
  rowname_regexp = NULL
)

Testing the Semantic Space

Section Overview

What you will learn: How to query nearest neighbours in a word2vec model; and what the COOEE semantic space reveals about the vocabulary of early Australian English

The closest_to() function returns the words most similar to a query term according to cosine similarity in the vector space. These results give us a way to sanity-check the model before building maps. The examples here come from a run with window size 10.

Content words

Code
closest_to(training, "convict", 30)

The first settlers were convicts. A mutiny during transport was the greatest danger for the captain and the midshipman; surgeons were stretched. The word female is more surprising — it appears because frequent phrases like “male and female convicts” make female a near-neighbour of convict, even with a small window.

Code
closest_to(training, "letter", 30)

COOEE is a corpus of letters; this query shows what letter is associated with — delivery, postage, and the act of writing and receiving.

Code
closest_to(training, "dear", 30)

Dear is primarily used to address recipients and to formally express affection.

Code
closest_to(training, "england", 30)

The associations of england include expected relatives such as scotland, but also colony and sailed — reflecting the long sea voyage that separated the colonists from home.

Code
closest_to(training, "australia", 30)
Code
closest_to(training, "government", 30)

Period labels

One of the most interesting features of the COOEE corpus is its embedded period labels. Querying these pseudo-words reveals the dominant themes of each historical period.

Code
closest_to(training, "periodone", 30)

Period 1 (1788–1825) — the earliest settlement period — returns years falling within the period, and names of people prominent in those years. For example, Frederick Garling (1775–1848) was one of the first solicitors admitted in Australia (Wikipedia).

Code
closest_to(training, "periodtwo", 30)

Period 2 (1826–1850) again returns years and personal names, reflecting the continued growth of the colony and its social structures.

Code
closest_to(training, "periodthree", 30)

Period 3 (1851–1875) — the gold rush era and expansion inland. Person names dominate; to see what is distinctive about this period compared to others, we would need to dig deeper.

Code
closest_to(training, "periodfour", 30)

Period 4 (1875–1900) is foreshadowing Australia’s federation. Among the top neighbours of periodfour we find federal, parliament, speaker and senator — the Australian Parliament was founded in 1901, and this historic event is already visible in the letters of the preceding decades.


Building the Similarity Matrix and Graph

Section Overview

What you will learn: How to construct a word–word cosine similarity matrix from a word2vec model; how to convert it to a long-form data frame; how to filter it; and how to build an igraph object that can be visualised with multiple layout algorithms

Selecting words

We take the 1,000 most frequent words in the model as our vocabulary for the maps. You can experiment with this number — 500 to 1,000 is a good range. More words make the graph richer but slower to compute and harder to read.

Code
word_list <- rownames(training)[1:1000]

Subsetting the model

Code
sub_model <- training[word_list, ]

Computing cosine similarities

We compute the full word–word cosine similarity matrix. This is a 1,000 × 1,000 matrix where every cell contains the cosine similarity between two words.

Code
similarity_matrix <- cosineSimilarity(sub_model, sub_model)

# Inspect the top-left corner as a sanity check
similarity_matrix[1:10, 1:10]

Saving the matrix

It is good practice to save this intermediate result so you can reload it without recomputing:

Code
write.csv(
  as.data.frame(similarity_matrix),
  here::here("tutorials/conceptualmaps_showcase/data/word_similarity_matrix_top1000_w10.csv")
)

Converting to long form

Graph tools and igraph expect an edge list (long form) rather than a square matrix. We convert using as.table():

Code
similarity_df <- as.data.frame(as.table(similarity_matrix))
colnames(similarity_df) <- c("word1", "word2", "similarity")
head(similarity_df)

Filtering

We apply three filters:

  1. Remove stopwords (using the tidytext stopword list)
  2. Remove very short words (3 characters or fewer)
  3. Keep only pairs with cosine similarity above 0.25, excluding self-similarities
Code
# Use quanteda's stopword list as a fallback — avoids tidytext data dependency
eng_stopwords <- quanteda::stopwords("english")

# Remove stopwords
similarity_df <- similarity_df |>
  filter(
    !word1 %in% eng_stopwords,
    !word2 %in% eng_stopwords
  )

# Remove short words
similarity_df <- similarity_df |>
  filter(
    nchar(as.character(word1)) > 3,
    nchar(as.character(word2)) > 3
  )

# Keep only strong similarities, exclude self-pairs
similarity_df <- subset(
  similarity_df,
  similarity > 0.25 & word1 != word2
)

# Inspect top 50 most similar pairs
similarity_df |> arrange(desc(similarity)) |> head(50)
Similarity Threshold

Setting the threshold above 0.5 will cause the graph to split into disconnected sub-graphs, losing the global structure that makes the maps interpretable. A threshold between 0.2 and 0.35 works well for COOEE with 1,000 words.

Building the igraph object

We now have everything we need to build an igraph object. We also add a label attribute (for compatibility with Gephi and Graphia) and a weight attribute (for layout algorithms that use it):

Code
g  <- graph_from_data_frame(similarity_df, directed = FALSE)
g2 <- g  # keep a copy of the original before we modify g

# Add label attribute (igraph default node name is "name")
V(g)$label  <- V(g)$name

# Add weight attribute (expected by Gephi, Graphia, and some igraph layouts)
E(g)$weight <- E(g)$similarity

# Sanity check
head(V(g))
head(E(g))

Exporting the graph

Exporting to GraphML/GML allows you to import the graph into Gephi or Graphia for further exploration:

Code
write_graph(
  g,
  here::here("tutorials/conceptualmaps_showcase/data/COOEE_w10_simgt0.25.gml"),
  format = "gml"
)

Method ONE: t-SNE Overview

Section Overview

What you will learn: How to apply t-SNE dimensionality reduction to the word2vec matrix; how to create an interactive plotly version; and what t-SNE reveals well (local cluster structure) and what it distorts (global distances)

The t-SNE algorithm (t-distributed Stochastic Neighbour Embedding) maps the high-dimensional word vectors to two dimensions while trying to preserve local neighbourhood structure. It is a non-linear mapping, capturing more variation than a single PCA projection.

A quick overview plot using the wordVectors built-in:

Code
plot(training)

For a more flexible and readable version, we apply Rtsne directly and label the points with ggplot2:

Code
termsize <- 1000  # number of terms to include

mytsne <- Rtsne(training[1:termsize, ])

tsne_plot <- mytsne$Y |>
  as.data.frame() |>
  mutate(word = rownames(training)[1:termsize]) |>
  ggplot(aes(x = V1, y = V2, label = word)) +
  geom_text(size = 2) +
  labs(title = "t-SNE projection of COOEE word2vec (top 1,000 words)",
       x = "t-SNE 1", y = "t-SNE 2") +
  theme_minimal()

plot(tsne_plot)

The static plot is dense. For better exploration, use the interactive plotly version where you can zoom and hover:

Code
# Build plotly directly — avoids the ggplotly conversion error
tsne_df <- mytsne$Y |>
  as.data.frame() |>
  mutate(word = rownames(training)[1:termsize])

plot_ly(
  data   = tsne_df,
  x      = ~V1,
  y      = ~V2,
  text   = ~word,
  type   = "scatter",
  mode   = "text",
  textfont = list(size = 9)
) |>
  layout(
    title  = "t-SNE projection of COOEE word2vec (top 1,000 words)",
    xaxis  = list(title = "t-SNE 1"),
    yaxis  = list(title = "t-SNE 2")
  )
Code
# Save as standalone HTML
tsne_interactive <- plot_ly(
  data     = tsne_df,
  x        = ~V1,
  y        = ~V2,
  text     = ~word,
  type     = "scatter",
  mode     = "text",
  textfont = list(size = 9)
) |>
  layout(
    title = "t-SNE projection of COOEE word2vec (top 1,000 words)",
    xaxis = list(title = "t-SNE 1"),
    yaxis = list(title = "t-SNE 2")
  )

saveWidget(
  widget = tsne_interactive,
  file   = here::here("tutorials/conceptualmaps_showcase/data/tsne_cooee.html")
)
Interpreting the t-SNE Map

The t-SNE graph reveals many semantically tight clusters: officers and officer, mile and miles, husband/wife/married, weather/warm/hot/wind. Thematic clusters include law (justice, judgement, jurisdiction, case, shall, duties), early settlement (periodone, king, settled, prisoner, charged, murder), and daily life (bread, tea, drink, hut, fire, house, garden, school, church).

Notably, natives and blacks overlap in the t-SNE space, indicating that these words were used as near-synonyms in the corpus — a finding with significant historical implications.

Limitation: t-SNE excels at preserving local cluster structure but distorts global distances. The positions of periodone, periodtwo, etc. relative to each other in this map are not reliable indicators of their semantic relationship.


Method TWO: igraph with Fruchterman-Reingold

Section Overview

What you will learn: How to apply the Fruchterman-Reingold force-directed layout to the similarity graph; the effect of edge weight rescaling on the layout; and how to export publication-quality PDFs

The Fruchterman-Reingold algorithm is a force-directed layout that treats edges as springs and nodes as repelling charges. Strongly similar words (high-weight edges) are pulled together; all words push each other apart. This gives a physically intuitive layout.

Basic layout

Code
set.seed(1)
plot.igraph(
  g,
  vertex.size      = 0,
  vertex.label.cex = 0.5,
  weights          = E(g)$similarity,
  edge.width       = E(g)$similarity / 5,
  main             = "Word Similarity Network — Fruchterman-Reingold"
)

With explicit weight parameter

Passing weights explicitly to layout_with_fr() ensures the edge weights actually influence the layout (this is not always the default):

Code
set.seed(1)
plot.igraph(
  g,
  layout           = layout_with_fr(g, weights = E(g)$weight),
  vertex.size      = 0,
  vertex.label.cex = 0.7,
  edge.width       = E(g)$similarity / 10,
  main             = "Word Similarity Network — FR with weights"
)

Interpreting the FR Map

The period labels now appear in distinct regions of the map. periodone is more central and surrounded by king, murder, and prisoner — reflecting the convict-dominated early settlement. periodtwo is characterised by family themes: mother, brother, sister, husband, child, and common names like John, Mary, and George. periodthree has months, weekdays, weather, and travel words — the Australians are exploring their new country. periodfour begins to show political vocabulary.

Rescaled weights

The raw cosine similarities (0–1) produce a narrow weight range. Rescaling to a wider range (1–100 or 1–10,000) increases the contrast between strong and weak similarities, often producing a cleaner layout:

Code
E(g)$w_scaled <- scales::rescale(E(g)$weight, to = c(1, 100))

set.seed(1)
plot.igraph(
  g,
  layout           = layout_with_fr(g, weights = E(g)$w_scaled, niter = 2000),
  vertex.size      = 0,
  vertex.label.cex = 0.7,
  edge.width       = E(g)$similarity / 5,
  main             = "Word Similarity Network — FR, weights rescaled to 1–100"
)


Method THREE: igraph with DRL

Section Overview

What you will learn: How to apply the DrL (Distributed Recursive Layout) algorithm, which is designed for large graphs; and how rescaling weights to a very wide range (1–10,000) affects DRL results

The DrL (Distributed Recursive Layout) algorithm (Martin et al. 2011) is designed for graphs with thousands or tens of thousands of nodes. It partitions the graph recursively and applies a force-directed algorithm at each level. It can handle larger graphs than Fruchterman-Reingold, but typically needs wider weight ranges to work well.

Code
E(g)$w_scaled_drl <- scales::rescale(E(g)$weight, to = c(1, 10000))

set.seed(1)
plot.igraph(
  g,
  layout           = layout_with_drl(g, weights = E(g)$w_scaled_drl),
  vertex.size      = 0,
  vertex.label.cex = 0.7,
  edge.width       = E(g)$similarity / 20,
  main             = "Word Similarity Network — DRL, weights rescaled to 1–10,000"
)

FR vs DRL

Fruchterman-Reingold tends to produce rounder, more balanced layouts. DRL tends to produce more elongated, clustered layouts that can reveal global separation between topic clusters more clearly. For COOEE at 1,000 words, both are viable — try both and compare.


Method FOUR: ForceAtlas2

Section Overview

What you will learn: How to apply the ForceAtlas2 algorithm — the default layout in Gephi — in R; why we use the unmodified copy g2 rather than the modified g; and what ForceAtlas2 reveals about the global structure of the COOEE semantic space

ForceAtlas2 is the default layout algorithm in Gephi. It is well suited for semantic networks because it is designed to produce layouts where global structure (inter-cluster distances) is meaningful. The layout.forceatlas2() function in the ForceAtlas2 R package animates the layout as it evolves; use plotstep to control how often an intermediate plot is displayed.

Use the Unmodified Copy g2

We have added scaled weight attributes to g in earlier sections. These can interfere with ForceAtlas2. We therefore use g2, the unmodified copy saved before any attribute additions.

Code
set.seed(1)
fa2_layout <- layout.forceatlas2(
  g2,
  iterations = 4000,
  plotstep   = 1000,   # show a plot every 1000 iterations
  directed   = FALSE
)

Code
set.seed(1)
plot.igraph(
  g2,
  layout           = fa2_layout,
  vertex.size      = 0,
  vertex.label.cex = 0.6,
  edge.width       = E(g2)$similarity / 10,
  main             = "Word Similarity Network — ForceAtlas2"
)

Interpreting the ForceAtlas2 Map

ForceAtlas2 reveals clear global trends: periodone clusters near king, ship, and prisoner. periodtwo is close to home, school, death, and married — the new Australians are coping with their new home and writing anxiously about their family. periodthree features expeditions in the new wilderness: journey, months, and logbook-like vocabulary. periodfour shows increasing political awareness: constitution, law, federal, and matters become prominent.

ForceAtlas2 is particularly good at showing this kind of global temporal structure — arguably better than FR or DRL for this corpus.

Running Outside RStudio

ForceAtlas2 is designed to show the graph constantly updating as it takes shape. Running it outside a code block (directly in the R console) displays a sequence of plots that is much more informative than a single static output. You can then zoom the final plot in the Plots tab and export to PDF from there.


Method FIVE: UMAP

Section Overview

What you will learn: How to apply UMAP (Uniform Manifold Approximation and Projection) to the word2vec matrix; how the n_neighbors parameter controls the balance between local and global structure; and why UMAP excels at local detail but cannot reliably map global distances

UMAP (McInnes, Healy, and Melville 2018) is a non-linear dimensionality reduction method that has become very popular as a faster and often more flexible alternative to t-SNE. The key parameter is n_neighbors: smaller values preserve fine-grained local structure; larger values preserve more of the global topology.

Code
set.seed(1)
umap_result <- umap(
  sub_model,
  n_neighbors  = 500,   # large value → more global structure
  min_dist     = 0.2,   # cluster tightness
  n_components = 2,
  metric       = "euclidean"
)

plot(
  umap_result[, 1],
  umap_result[, 2],
  pch  = 1,
  col  = "white",
  xlab = "UMAP 1",
  ylab = "UMAP 2",
  main = "UMAP projection of COOEE word2vec (top 1,000 words)"
)
text(
  umap_result[, 1],
  umap_result[, 2],
  labels = rownames(sub_model),
  cex    = 0.7
)

Interpreting the UMAP Map

UMAP is very accurate in local detail: person names, months, numbers, and other semantically tight groups cluster together correctly. However, the placement of the period labels (periodone, periodtwo, etc.) relative to each other looks almost arbitrary. This reflects a well-known property of UMAP: it is superior for local neighbourhood structure but cannot reliably represent global distances between clusters. For questions about the relative positions of major thematic groups, ForceAtlas2 or Fruchterman-Reingold are more appropriate.


Method SIX: Graph from Textplot (GML Import)

Section Overview

What you will learn: How to import a pre-computed GML graph file into R; how to apply igraph and ForceAtlas2 layouts to an externally generated graph; and how the textplot tool differs from the word2vec approach used above

External Tool Required

This section uses a .gml file generated by the textplot command-line tool (McClure 2015, GitHub). The GML file for COOEE is available for download:

Code
download.file(
  url      = "https://ladal.edu.au/tutorials/conceptualmaps_showcase/data/ALL_byperiod_momarkup3_t400-s10.gml",
  destfile = "tutorials/conceptualmaps_showcase/data/ALL_byperiod_momarkup3_t400-s10.gml",
  mode     = "wb"
)

The file was generated with the following command (for reference only):

textplot generate --term_depth 400 --skim_depth 10 --bandwidth 30000 \
  ALL_byperiod_nomarkup.txt ALL_byperiod_momarkup3_t400-s10.gml

Loading the GML file

Code
gml_file <- here::here("tutorials/conceptualmaps_showcase/data/ALL_byperiod_momarkup3_t400-s10.gml")
g_tp <- read_graph(gml_file, format = "gml")

Visualising with Fruchterman-Reingold

Code
set.seed(1)
plot.igraph(
  g_tp,
  layout           = layout_with_fr(g_tp, weights = E(g_tp)$weight),
  vertex.size      = 0,
  vertex.label.cex = 0.8,
  edge.width       = E(g_tp)$similarity / 5,
  main             = "Textplot Graph — Fruchterman-Reingold"
)
Interpreting the Textplot FR Map

A legal cluster is visible at the top of the map, with court, defendant, and judge. periodone is near captain, boat, ship, and convicts. Periods 2, 3, and 4 are relatively close to each other, near family relations (sister, father, brother) and affection (love).

Visualising with ForceAtlas2

When loading a .gml file from textplot, node names are stored in the label attribute rather than igraph’s default name. We need to copy label to name before using ForceAtlas2:

Code
V(g_tp)$name <- V(g_tp)$label

set.seed(1)
fa2_layout_tp <- layout.forceatlas2(
  g_tp,
  iterations = 5000,
  plotstep   = 500,
  directed   = FALSE,
  gravity    = 0.8,
  k          = 10000,
  ks         = 5,
  delta      = 1
)

plot.igraph(
  g_tp,
  layout           = fa2_layout_tp,
  vertex.size      = 0,
  vertex.label.cex = 0.6,
  edge.width       = E(g_tp)$weight / 10,
  main             = "Textplot Graph — ForceAtlas2"
)

Comparing the Six Methods

Summary

We have now built conceptual maps of the COOEE corpus using six different methods. Here is a summary of their strengths and weaknesses for this type of task:

Method Local detail Global structure Speed Interactivity
t-SNE ⭐⭐⭐ Medium ✓ via plotly
igraph FR ⭐⭐ ⭐⭐ Fast
igraph DRL ⭐⭐ ⭐⭐ Fast
ForceAtlas2 ⭐⭐ ⭐⭐⭐ Slow Animated
UMAP ⭐⭐⭐ Fast
Textplot + FA2 ⭐⭐ ⭐⭐⭐ Slow

Key findings from comparing the methods:

  • t-SNE and UMAP excel at revealing tight local clusters (synonyms, near-synonyms, semantic categories) but their global layouts are not reliable — do not read meaning into the distances between major clusters.
  • Fruchterman-Reingold and DRL provide a reasonable balance between local and global structure. Rescaling the edge weights (to 1–100 or 1–10,000) has a substantial effect on the layout quality.
  • ForceAtlas2 produces the most interpretable global layout for this corpus, clearly separating the four historical periods and placing them near their most characteristic vocabulary.
  • Textplot + ForceAtlas2 produces very similar results to the word2vec + ForceAtlas2 approach, suggesting that the layout algorithm matters more than the specific edge-weighting method, at least for this corpus.

There is no single best method. The choice depends on the research question: use t-SNE or UMAP to explore fine-grained semantic categories; use ForceAtlas2 or Fruchterman-Reingold to understand global thematic organisation.


Final Comments

As Tangherlini and Leonard (2013) argue in the context of topic modelling, computational methods offer a division of labour: the algorithm handles counting and similarity computation, while the researcher applies domain expertise to interpret the output. Conceptual maps are a particularly powerful illustration of this: they make the latent structure of a large corpus visible at a glance, but the interpretation of what the clusters mean — and what the distances between them imply — always requires human judgement.

The comparison of methods presented here also reinforces a broader methodological lesson: the same underlying data can look very different depending on how it is projected into two dimensions. Before drawing conclusions from any conceptual map, it is worth asking: does this layout algorithm preserve local structure, global structure, or both? Is the placement of nodes determined by the data, or partly by the algorithm’s own biases?


Citation & Session Info

Schneider, Gerold. 2026. Comparing Methods for Conceptual Maps. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/conceptualmaps_showcase/conceptualmaps_showcase.html (Version 2026.05.01).

@manual{schneider2026conceptualmaps_showcase,
  author       = {Schneider, Gerold},
  title        = {Comparing Methods for Conceptual Maps},
  note         = {tutorials/conceptualmaps_showcase/conceptualmaps_showcase.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.05.01}
}
AI Transparency Statement

This tutorial was adapted for LADAL by Martin Schweinberger with the assistance of Claude (claude.ai), a large language model created by Anthropic. The original tutorial was authored by Gerold Schneider (2026). The adaptation involved converting the document to Quarto format; fixing the YAML; replacing all hardcoded absolute paths with portable relative paths; removing all PDF-iframe embed patterns and replacing them with inline R plot output; adding LADAL-style section overviews, learning objectives, a prerequisite callout; consolidating duplicate UMAP plot blocks; adding set.seed(1) to the UMAP block for reproducibility; and adding the GML download block so the textplot section can be run without access to external tools. All scientific content, interpretation, and code logic are the work of the original author.

Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ForceAtlas2_0.1   wordVectors_2.0   uwot_0.2.4        Matrix_1.7-2     
 [5] scales_1.4.0      htmlwidgets_1.6.4 plotly_4.12.0     Rtsne_0.17       
 [9] reshape2_1.4.5    ggrepel_0.9.8     tidytext_0.4.3    lubridate_1.9.4  
[13] forcats_1.0.0     stringr_1.6.0     dplyr_1.2.0       purrr_1.2.1      
[17] readr_2.1.5       tidyr_1.3.2       tibble_3.3.1      ggplot2_4.0.2    
[21] tidyverse_2.0.0   igraph_2.2.2     

loaded via a namespace (and not attached):
 [1] janeaustenr_1.0.0   generics_0.1.4      renv_1.1.7         
 [4] stringi_1.8.7       lattice_0.22-6      hms_1.1.4          
 [7] digest_0.6.39       magrittr_2.0.4      evaluate_1.0.5     
[10] grid_4.4.2          timechange_0.3.0    RColorBrewer_1.1-3 
[13] fastmap_1.2.0       plyr_1.8.9          jsonlite_2.0.0     
[16] BiocManager_1.30.27 httr_1.4.7          viridisLite_0.4.2  
[19] lazyeval_0.2.2      cli_3.6.5           rlang_1.1.7        
[22] tokenizers_0.3.0    withr_3.0.2         yaml_2.3.10        
[25] tools_4.4.2         tzdb_0.5.0          vctrs_0.7.2        
[28] R6_2.6.1            lifecycle_1.0.5     pkgconfig_2.0.3    
[31] pillar_1.11.1       gtable_0.3.6        data.table_1.17.0  
[34] glue_1.8.0          Rcpp_1.1.1          xfun_0.56          
[37] tidyselect_1.2.1    rstudioapi_0.17.1   knitr_1.51         
[40] farver_2.1.2        htmltools_0.5.9     SnowballC_0.7.1    
[43] rmarkdown_2.30      compiler_4.4.2      S7_0.2.1           

Back to top

Back to LADAL home


References

Martin, Shawn, W Michael Brown, Richard Klavans, and Kevin W Boyack. 2011. “OpenOrd: An Open-Source Toolbox for Large Graph Layout.” In Visualization and Data Analysis 2011, 7868:45–55. SPIE.
McInnes, Leland, John Healy, and James Melville. 2018. “Umap: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv Preprint arXiv:1802.03426.
Tangherlini, Timothy R, and Peter Leonard. 2013. “Trawling in the Sea of the Great Unread: Sub-Corpus Topic Modeling and Humanities Research.” Poetics 41 (6): 725–49. https://doi.org/https://doi.org/10.1016/j.poetic.2013.08.002.