Resources

A curated guide to tools, corpora, courses, and communities for language technology and text analysis

This curated collection brings together the best resources for language technology, text analytics, corpus linguistics, natural language processing, and computational methods in the humanities and social sciences. Whether you’re a complete beginner or an experienced researcher, you’ll find tools, tutorials, datasets, and communities to support your work.


LDaCA: LADAL’s Home Platform

LADAL is part of the Language Data Commons of Australia (LDaCA), a national research infrastructure providing access to language data and text analysis tools for Australian researchers. LDaCA emerged from the merger of the Australian Text Analytics Platform (ATAP) and PARADISEC, bringing together text analytics infrastructure and endangered language archives.

🗂️ Data Discovery

Search and access diverse language datasets including Australian English, Indigenous languages, migrant languages, oral history collections, and social media data.

📓 Jupyter Notebooks

Browser-based interactive coding environment — no installation needed. Ready-to-use text analysis capabilities for researchers without strong coding backgrounds.

🎓 Training & Support

Workshops, tutorials, documentation, and community support for language data researchers nationally. Free for Australian researchers.

Visit ldaca.edu.au to explore data collections, create a free account, and access tools and training.


Concordancing and Corpus Tools

AntConc and AntLab Suite

Laurence Anthony · Waseda University

Free

AntConc is the most widely-used free concordancing tool worldwide, cross-platform (Windows, Mac, Linux) and ideal for teaching and research.

Core features: KWIC concordance · Collocates · Word clusters (n-grams) · Keyword analysis · Dispersion plots

Other AntLab tools:

  • AntFileConverter — convert PDF/Word/Excel to plain text
  • AntPConc — parallel concordancer for translation studies
  • AntWordProfiler — vocabulary profiling and analysis
  • AntGram — n-gram and word frequency analysis
  • FireAnt — download and organise online texts
  • EncodeAnt, VariAnt, AntFileSplitter, AntMover
Concordancing Collocations Keywords Beginner-friendly
Download AntConc →

#LancsBox

Lancaster University

Free

#LancsBox is a next-generation corpus toolkit combining ease-of-use with advanced functionality and beautiful visualisations.

  • GraphColl — visualise collocational networks
  • Whelk — powerful regex search
  • Wizard — guided analysis for beginners
  • Built-in statistical tests · Multi-language support
Network visualisation Collocations Regex
Visit #LancsBox →

Sketch Engine

Lexical Computing

Free trial

Sketch Engine is a comprehensive commercial platform with 90+ pre-loaded corpora and web access.

  • Word Sketches — grammatical/collocational summaries at a glance
  • Corpus building, terminology extraction, parallel corpora
  • Team collaboration and sharing features
  • Best for multilingual research and large-scale projects
Multilingual Word sketches Terminology Translation
Visit Sketch Engine →

WordSmith Tools

Mike Scott · Lexically.net

WordSmith Tools is the professional standard for corpus analysis, trusted by researchers since 1996. Windows only (runs via Wine on Mac).

  • Concord — concordancing with sophisticated search
  • KeyWords — statistical keyword extraction
  • WordList — frequency lists and statistics
  • Dispersion plots · Collocate analysis · Batch processing
Professional Keywords Statistics Windows
Visit WordSmith →

Online Concordancers

No installation needed — use these directly in your browser.

Free with registration

BYU Corpora Family

COCA (1 billion words, 1990–present), COHA (400M words, 1820s–present), NOW Corpus, TV/Movie/Wikipedia corpora. Genre and time filtering.

Free

Lextutor

Web-based concordancers, vocabulary profilers, and multiple corpora. Excellent for language learning and teaching.


Text Analysis and NLP Tools

Free

Voyant Tools

Zero installation — works in browser. Upload texts instantly for word clouds, trend graphs, network visualisations, and more. Perfect for digital humanities and teaching.

Browser-based Visualisation Beginner
Free

Orange Data Mining

Visual drag-and-drop tool for text analytics and machine learning — no coding required. Topic modeling (LDA), sentiment analysis, document clustering, word clouds.

No-code Topic modelling ML
Free

GATE

Open-source platform for large-scale NLP pipelines: information extraction, named entity recognition, relation extraction, semantic annotation, language identification.

NLP pipelines NER Annotation
Free

spaCy

Industrial-strength Python NLP library. Tokenisation, POS tagging, NER, dependency parsing, word vectors. Pre-trained models for 60+ languages. Fast and production-ready.

Python 60+ languages Production
Free

NLTK

Python's foundational NLP learning platform. Comprehensive tutorials, many datasets included, wide range of algorithms. Ideal for learning NLP from scratch.

Python Educational Beginner
Free

Stanford CoreNLP

State-of-the-art NLP suite: tokenisation, POS, NER, parsing, coreference resolution, sentiment. Accessible via online demo, command line, Java, Python (stanza), or R.

Multi-language access Coreference Parsing

Specialised Tools

Free

BookNLP

NLP pipeline designed specifically for books and long documents. Character name clustering, speaker identification, referential gender inference, event tagging. GPU and CPU models available.

Literary analysis Character analysis Python
Web demo free

CLAWS POS Tagger

Lancaster's world-leading POS tagger — 96–97% accuracy. Tagged the British National Corpus. Web demo and batch processing available. Multiple tagsets.

POS tagging High accuracy
Academic free

USAS Semantic Tagger

UCREL Semantic Analysis System — automatic semantic tagging across 21 major discourse fields, multi-word expression recognition, multiple languages. See also WMatrix for corpus comparison.

Semantic tagging Discourse
Free

SMARTool

Corpus-based tool for English-speaking learners of Russian. Handles rich Russian morphology, 3,000 basic vocabulary items, frequency-based learning.

Russian Language learning

Learning Resources and Courses

Applied Language Technology

University of Helsinki · Free · Self-paced

Free

Two courses designed for linguists and humanists, with Jupyter notebooks and hands-on exercises. No prior programming knowledge needed.

  • Working with Text in Python — Python basics, regular expressions, text processing
  • NLP for Linguists — NLP fundamentals, machine learning basics, deep learning for NLP
Start Learning →

Introduction to Cultural Analytics & Python

Melanie Walsh · Free · Online textbook

Free

Outstanding textbook written specifically for humanities and social science scholars. Clear explanations, engaging datasets, continuously updated.

  • Python basics · Text analysis and NLP · Social media analysis
  • Network analysis · Mapping · Web scraping · Data visualisation
Read the Textbook →

The Programming Historian

Peer-reviewed · Available in EN, ES, FR, PT

Free

Peer-reviewed, collaborative lessons for digital humanists in Python, R, JavaScript, and more. Covers data management, distant reading, network analysis, mapping, GIS, web scraping, and visualisation.

Browse Lessons →

R and Statistics

Free

R for Data Science

By Hadley Wickham & Garrett Grolemund. Modern data science workflow using the tidyverse ecosystem. The standard starting point for R learners.

Free

Text Mining with R

By Julia Silge & David Robinson. Tidy approach to text analysis with practical, reproducible examples throughout.

Free

Quanteda Tutorials

Official quanteda documentation with comprehensive corpus analysis guides — our recommended R framework for text analysis.

Free

Advanced R

By Hadley Wickham. Deep dive into R programming for experienced users wanting to understand the language fully.

Free

STHDA

Comprehensive R tutorials for statistical methods, data visualisation with ggplot2, and machine learning.

Free

Quick-R

Quick-reference R code snippets for data management, statistics, and visualisation by Rob Kabacoff.

Specialist Training Platforms

Free

GLAM Workbench

Tools and tutorials for working with data from galleries, libraries, archives, and museums in Australia and New Zealand. Jupyter notebooks — click and run, no installation needed.

Free

TAPoR 3

Curated directory of 1,500+ text analysis research tools with descriptions, reviews, and comparison features. Search by analysis type, platform, language, cost, or discipline.

Paid

Lancaster Summer Schools

Annual intensive corpus linguistics courses, beginner to advanced, with expert instruction and hands-on training at Lancaster University.


Research Centres and Labs

UCREL, Lancaster University

World-leading corpus linguistics research centre. Developers of CLAWS, USAS, and Wmatrix. Home of the British National Corpus and annual summer schools.

Corpus linguistics NLP tools

VARIENG, University of Helsinki

Research Unit for Variation, Contacts and Change in English. Home of the Helsinki Corpus, Corpus of Early English Correspondence, and multiple parsed corpora.

Language change Historical corpora

Sydney Corpus Lab

Promotes corpus linguistics across Australia — workshops, training, research collaboration, and community building for corpus linguists.

Australia Community

Text Crunching Centre, University of Zurich

NLP expertise as a service: consulting, custom tool development, text processing pipelines, sentiment analysis, topic modelling, and named entity recognition.

NLP consulting Custom pipelines

AcqVA Aurora Lab, UiT

Research in language acquisition, variation, and attrition. Offers methodological consultation, data collection facilities, and analysis support.

Acquisition Psycholinguistics

Stanford Literary Lab

Computational literary studies using quantitative methods. Research publications and innovative approaches to large-scale literary analysis.

Digital humanities Literary studies

Media Research Methods Lab, HBI

Computational social science, social media analysis, automated content analysis, and network analysis at Leibniz Institute for Media Research.

Social media Computational SS

NaCTeM

UK's first publicly-funded text mining centre. Software tools, training materials, and services in literature-based discovery and biomedical text mining.

Text mining Biomedical NLP

Corpora and Datasets

Major English Corpora

British National Corpus (BNC)

100M words · 1980s–90s · Written & spoken · POS tagged

Gold standard reference corpus for British English. BNC2014 (100M words, 2010s) available for modern comparisons.

COCA

1 billion words · 1990–2019 · Free with registration

Corpus of Contemporary American English. Balanced across spoken, fiction, magazines, newspapers, and academic genres.

International Corpus of English (ICE)

1M words per variety · 20+ national varieties

Parallel corpora across national varieties including GB, USA, Ireland, Canada, India, Hong Kong, and more. Comparable structure, grammatically annotated.

Google Books Ngram Corpus

Trillions of words · 1500–2019 · Multiple languages

Phrase frequency over time across multiple languages. Excellent for diachronic studies. Dataset downloadable for offline analysis.

Historical Corpora

COHA

400M words · 1820s–2000s · Balanced by decade

Corpus of Historical American English. Fiction, magazines, newspapers, and non-fiction for tracking language change over two centuries.

Early English Books Online (EEBO)

1473–1700 · 25,000+ texts

Covers the beginnings of English printing. Critical for historical linguistics and early modern English research.

Corpus of English Dialogues (CED)

1560–1760 · University of Helsinki

Trial proceedings, drama, and didactic works representing real and simulated conversation in Early Modern English.

Specialised Corpora

MICASE

1.8M words · Academic spoken English

Michigan Corpus of Academic Spoken English. Lectures, seminars, and study groups. Searchable online.

VOICE

1M words · 50+ L1 backgrounds

Vienna-Oxford International Corpus of English. Face-to-face interaction in English as a Lingua Franca.

CHILDES

50+ languages · Longitudinal

Child Language Data Exchange System. Transcription standards and CLAN analysis tools for language acquisition research.

EFCAMDAT

70M words · 150+ countries

EF-Cambridge Open Language Database. Large-scale learner corpus for L2 English research.

Multilingual and Parallel Corpora

Universal Dependencies

100+ languages · Open source

Syntactically annotated treebanks with cross-linguistically consistent annotation. Regular releases.

Leipzig Corpora Collection

136 languages · Web-crawled

Freely downloadable web-crawled text corpora in 136 languages of various sizes.

OPUS

90+ languages · Free download

Open parallel corpus collection: movie subtitles, Bible translations, EU documents, OpenSubtitles, and more.

Europarl

21 languages · Sentence-aligned

European Parliament proceedings — large-scale, sentence-aligned parallel corpus for translation research and MT.


Blogs and Communities


Additional Tools

Visualisation

Free

ggplot2 (R)

Grammar of Graphics implementation for R. Publication-quality, highly customisable plots. The standard for R visualisation.

Free

Gephi

Interactive network visualisation for large graphs. Open source, widely used for social network and co-occurrence analysis.

Free

Plotly (R & Python)

Interactive, web-ready graphs in both R and Python. Excellent for sharing interactive visualisations online.

Free

Shiny (R)

Build interactive web apps from R — no web development skills needed. Great for sharing research tools.

Annotation Tools

Free

WebAnno

Web-based multi-user annotation with inter-annotator agreement metrics. Supports many annotation types.

Free

INCEpTION

Semantic annotation with knowledge base integration and active learning recommendations. Open source.

Free

brat

Browser-based linguistic annotation for entities, relationships, and coreference chains.

Prodigy

Modern annotation tool with active learning for rapid, efficient data labelling. From the makers of spaCy.

Data Management

Free

Open Science Framework (OSF)

Project management, preregistration, DOI minting, and version control for open science workflows.

Free

Zenodo

Long-term data preservation with DOI minting. GitHub integration. Free storage for datasets and code.

Free

GitHub

Version control, collaboration, and open science. Host code, data, and project websites. Essential for reproducible research.

Machine Learning and Deep Learning

Free

Hugging Face

Transformers library, pre-trained models (BERT, GPT, RoBERTa, XLM), datasets, and a community model hub. The go-to platform for modern NLP.

Free

PyTorch

Research-friendly deep learning framework with dynamic computation graphs. Growing adoption in NLP research.

Free

TensorFlow / Keras

Google's deep learning framework with the Keras high-level API. Production-friendly and widely used for NLP applications.


Getting Started

2

Install first tools

3

Learn programming

4

Specialise and share

  • Choose your focus area
  • Join a community
  • Start a research project
  • Contribute back

Python Visualisation

Free

matplotlib

Foundation visualisation library for Python. Highly customisable, publication-quality, and the basis for many other libraries.

Free

seaborn

Statistical visualisation with beautiful defaults, built on matplotlib. Great for data exploration.

Free

Altair

Declarative, interactive visualisation for Python based on Vega-Lite. Clean grammar-of-graphics approach.

Free

Cytoscape

Network analysis and visualisation with biological and linguistic applications. Extensible plugin ecosystem.

Python Text Analysis Frameworks

Free

scikit-learn

Machine learning for Python. Text classification, clustering, and feature extraction with a consistent, well-documented API.

Free

Gensim

Topic modelling, word embeddings (Word2Vec, fastText), and document similarity at scale. Efficient with large corpora.

Free

TextBlob

Simple NLP in Python: sentiment analysis, part-of-speech tagging, translation. A great starting point for beginners.

R Text Analysis Frameworks

Free

quanteda

Comprehensive, fast, and well-documented text analysis for R. Our recommended choice for corpus analysis.

Free

tidytext

Tidy data principles applied to text mining. Integrates seamlessly with dplyr and ggplot2.

Free

tm (text mining)

Established R package for text mining. Document-term matrices, preprocessing, and broad tool compatibility.

Documentation and Publishing

Free

Jupyter Book

Create beautiful, interactive online books directly from Jupyter notebooks. Ideal for sharing reproducible research.

Free

Read the Docs

Documentation hosting with Sphinx integration and version support. Standard platform for open-source project docs.

Free

Dataverse

Open-source data repository network used by universities and research institutions. Supports data citation and archiving.

Free

GitLab

Alternative to GitHub with private repositories, built-in CI/CD, and self-hosting options.


Awards and Showcases

Digital Humanities Awards

Annual community-voted recognition of excellent DH work across categories including Best Tool, Best Dataset, Best Visualisation, Best Training Materials, and Best Use of DH for Fun. A great way to discover cutting-edge projects.

Humanities Commons

Academic social network for sharing research, discovering projects, and collaborating across the humanities. Open access, community-governed.

King's Digital Lab

Software development, digital humanities projects, research infrastructure, and training at King's College London.

Nebraska Literary Lab

Digital humanities research, text analysis projects, and teaching resources at the University of Nebraska.


Stay Updated

Conferences

Corpus Linguistics: ICAME (International Computer Archive of Modern and Medieval English) · Corpus Linguistics (biennial) · CLUK (Corpus Linguistics in the UK)

Digital Humanities: DH (Digital Humanities conference) · TEI (Text Encoding Initiative) · ADHO (Alliance of Digital Humanities Organizations) · DHd Blog (German DH community, multilingual posts)

NLP/Computational Linguistics: ACL · NAACL · EMNLP · COLING

Browse upcoming events at linguistic-conferences.org and dh-abstracts.library.cmu.edu.


Contributing a Tutorial

Want to add a tutorial to LADAL? We welcome contributions from researchers and practitioners at all career stages.

Download templates: Tutorial template (.Rmd) · Bibliography file (.bib)

Required elements: Clear learning objectives · Setup instructions · Worked examples with code · Exercises with solutions · sessionInfo() for reproducibility

Submission: Develop your tutorial, test all code, then email your .Rmd file and any supporting data files to ladal@uq.edu.au. We’ll review, suggest any revisions, and publish with full attribution.

Code style: Function and package names in backticks · Consistent indentation · Comments on complex operations · Use library() not require()

Tables: Use flextable · Clear captions · Appropriate width settings

Exercises: Use the standard exercise block with collapsible <details> answer sections so readers can attempt the problem before revealing the solution.

Citation format for your tutorial:

YourLastName, YourFirstName. 2026. *The Title of Your Tutorial*.
Your Institution. url: https://ladal.edu.au/tutorials/.../....html (Version 2026.MM.DD).

BibTeX entry:

@manual{yourlastname2026topic,
  author = {YourLastName, YourFirstName},
  title = {The Title of Your Tutorial},
  note = {https://slcladal.github.io/shorttitleofyourtutorial.html},
  year = {2026},
  organization = {Your Affiliation},
  address = {Your Location},
  edition = {2026.MM.DD}
}

Know a resource we've missed?

This page is a living document. If you have a tool, corpus, course, or community to suggest, we'd love to hear from you.

References

Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user. Cambridge University Press.

Anthony, L. (2004). AntConc: A learner and classroom friendly, multi-platform corpus analysis toolkit. IWLeL 2004, 7–13.

McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press.

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media.

Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.

Walsh, M. (2021). Introduction to Cultural Analytics & Python. https://melaniewalsh.github.io/Intro-Cultural-Analytics/

Wickham, H., & Grolemund, G. (2016). R for data science. O’Reilly Media.


Last updated: 2026-02-08 · Maintained by: The LADAL Team · Contribute: Contact us

Back to top | Back to HOME