Resources
A curated guide to tools, corpora, courses, and communities for language technology and text analysis

This curated collection brings together the best resources for language technology, text analytics, corpus linguistics, natural language processing, and computational methods in the humanities and social sciences. Whether you’re a complete beginner or an experienced researcher, you’ll find tools, tutorials, datasets, and communities to support your work.
LDaCA: LADAL’s Home Platform
LADAL is part of the Language Data Commons of Australia (LDaCA), a national research infrastructure providing access to language data and text analysis tools for Australian researchers. LDaCA emerged from the merger of the Australian Text Analytics Platform (ATAP) and PARADISEC, bringing together text analytics infrastructure and endangered language archives.
🗂️ Data Discovery
Search and access diverse language datasets including Australian English, Indigenous languages, migrant languages, oral history collections, and social media data.
📓 Jupyter Notebooks
Browser-based interactive coding environment — no installation needed. Ready-to-use text analysis capabilities for researchers without strong coding backgrounds.
🎓 Training & Support
Workshops, tutorials, documentation, and community support for language data researchers nationally. Free for Australian researchers.
Visit ldaca.edu.au to explore data collections, create a free account, and access tools and training.
Concordancing and Corpus Tools
AntConc and AntLab Suite
Laurence Anthony · Waseda University
AntConc is the most widely-used free concordancing tool worldwide, cross-platform (Windows, Mac, Linux) and ideal for teaching and research.
Core features: KWIC concordance · Collocates · Word clusters (n-grams) · Keyword analysis · Dispersion plots
Other AntLab tools:
- AntFileConverter — convert PDF/Word/Excel to plain text
- AntPConc — parallel concordancer for translation studies
- AntWordProfiler — vocabulary profiling and analysis
- AntGram — n-gram and word frequency analysis
- FireAnt — download and organise online texts
- EncodeAnt, VariAnt, AntFileSplitter, AntMover
#LancsBox
Lancaster University
#LancsBox is a next-generation corpus toolkit combining ease-of-use with advanced functionality and beautiful visualisations.
- GraphColl — visualise collocational networks
- Whelk — powerful regex search
- Wizard — guided analysis for beginners
- Built-in statistical tests · Multi-language support
Sketch Engine
Lexical Computing
Sketch Engine is a comprehensive commercial platform with 90+ pre-loaded corpora and web access.
- Word Sketches — grammatical/collocational summaries at a glance
- Corpus building, terminology extraction, parallel corpora
- Team collaboration and sharing features
- Best for multilingual research and large-scale projects
WordSmith Tools
Mike Scott · Lexically.net
WordSmith Tools is the professional standard for corpus analysis, trusted by researchers since 1996. Windows only (runs via Wine on Mac).
- Concord — concordancing with sophisticated search
- KeyWords — statistical keyword extraction
- WordList — frequency lists and statistics
- Dispersion plots · Collocate analysis · Batch processing
Online Concordancers
No installation needed — use these directly in your browser.
BYU Corpora Family
COCA (1 billion words, 1990–present), COHA (400M words, 1820s–present), NOW Corpus, TV/Movie/Wikipedia corpora. Genre and time filtering.
Lextutor
Web-based concordancers, vocabulary profilers, and multiple corpora. Excellent for language learning and teaching.
Text Analysis and NLP Tools
Voyant Tools
Zero installation — works in browser. Upload texts instantly for word clouds, trend graphs, network visualisations, and more. Perfect for digital humanities and teaching.
Orange Data Mining
Visual drag-and-drop tool for text analytics and machine learning — no coding required. Topic modeling (LDA), sentiment analysis, document clustering, word clouds.
GATE
Open-source platform for large-scale NLP pipelines: information extraction, named entity recognition, relation extraction, semantic annotation, language identification.
spaCy
Industrial-strength Python NLP library. Tokenisation, POS tagging, NER, dependency parsing, word vectors. Pre-trained models for 60+ languages. Fast and production-ready.
NLTK
Python's foundational NLP learning platform. Comprehensive tutorials, many datasets included, wide range of algorithms. Ideal for learning NLP from scratch.
Stanford CoreNLP
State-of-the-art NLP suite: tokenisation, POS, NER, parsing, coreference resolution, sentiment. Accessible via online demo, command line, Java, Python (stanza), or R.
Specialised Tools
BookNLP
NLP pipeline designed specifically for books and long documents. Character name clustering, speaker identification, referential gender inference, event tagging. GPU and CPU models available.
CLAWS POS Tagger
Lancaster's world-leading POS tagger — 96–97% accuracy. Tagged the British National Corpus. Web demo and batch processing available. Multiple tagsets.
USAS Semantic Tagger
UCREL Semantic Analysis System — automatic semantic tagging across 21 major discourse fields, multi-word expression recognition, multiple languages. See also WMatrix for corpus comparison.
SMARTool
Corpus-based tool for English-speaking learners of Russian. Handles rich Russian morphology, 3,000 basic vocabulary items, frequency-based learning.
Learning Resources and Courses
Applied Language Technology
University of Helsinki · Free · Self-paced
Two courses designed for linguists and humanists, with Jupyter notebooks and hands-on exercises. No prior programming knowledge needed.
- Working with Text in Python — Python basics, regular expressions, text processing
- NLP for Linguists — NLP fundamentals, machine learning basics, deep learning for NLP
Introduction to Cultural Analytics & Python
Melanie Walsh · Free · Online textbook
Outstanding textbook written specifically for humanities and social science scholars. Clear explanations, engaging datasets, continuously updated.
- Python basics · Text analysis and NLP · Social media analysis
- Network analysis · Mapping · Web scraping · Data visualisation
The Programming Historian
Peer-reviewed · Available in EN, ES, FR, PT
Peer-reviewed, collaborative lessons for digital humanists in Python, R, JavaScript, and more. Covers data management, distant reading, network analysis, mapping, GIS, web scraping, and visualisation.
Browse Lessons →R and Statistics
R for Data Science
By Hadley Wickham & Garrett Grolemund. Modern data science workflow using the tidyverse ecosystem. The standard starting point for R learners.
Text Mining with R
By Julia Silge & David Robinson. Tidy approach to text analysis with practical, reproducible examples throughout.
Quanteda Tutorials
Official quanteda documentation with comprehensive corpus analysis guides — our recommended R framework for text analysis.
Advanced R
By Hadley Wickham. Deep dive into R programming for experienced users wanting to understand the language fully.
STHDA
Comprehensive R tutorials for statistical methods, data visualisation with ggplot2, and machine learning.
Quick-R
Quick-reference R code snippets for data management, statistics, and visualisation by Rob Kabacoff.
Specialist Training Platforms
GLAM Workbench
Tools and tutorials for working with data from galleries, libraries, archives, and museums in Australia and New Zealand. Jupyter notebooks — click and run, no installation needed.
TAPoR 3
Curated directory of 1,500+ text analysis research tools with descriptions, reviews, and comparison features. Search by analysis type, platform, language, cost, or discipline.
Lancaster Summer Schools
Annual intensive corpus linguistics courses, beginner to advanced, with expert instruction and hands-on training at Lancaster University.
Research Centres and Labs
UCREL, Lancaster University
World-leading corpus linguistics research centre. Developers of CLAWS, USAS, and Wmatrix. Home of the British National Corpus and annual summer schools.
VARIENG, University of Helsinki
Research Unit for Variation, Contacts and Change in English. Home of the Helsinki Corpus, Corpus of Early English Correspondence, and multiple parsed corpora.
Sydney Corpus Lab
Promotes corpus linguistics across Australia — workshops, training, research collaboration, and community building for corpus linguists.
Text Crunching Centre, University of Zurich
NLP expertise as a service: consulting, custom tool development, text processing pipelines, sentiment analysis, topic modelling, and named entity recognition.
AcqVA Aurora Lab, UiT
Research in language acquisition, variation, and attrition. Offers methodological consultation, data collection facilities, and analysis support.
Stanford Literary Lab
Computational literary studies using quantitative methods. Research publications and innovative approaches to large-scale literary analysis.
Media Research Methods Lab, HBI
Computational social science, social media analysis, automated content analysis, and network analysis at Leibniz Institute for Media Research.
NaCTeM
UK's first publicly-funded text mining centre. Software tools, training materials, and services in literature-based discovery and biomedical text mining.
Corpora and Datasets
Major English Corpora
British National Corpus (BNC)
Gold standard reference corpus for British English. BNC2014 (100M words, 2010s) available for modern comparisons.
COCA
Corpus of Contemporary American English. Balanced across spoken, fiction, magazines, newspapers, and academic genres.
International Corpus of English (ICE)
Parallel corpora across national varieties including GB, USA, Ireland, Canada, India, Hong Kong, and more. Comparable structure, grammatically annotated.
Google Books Ngram Corpus
Phrase frequency over time across multiple languages. Excellent for diachronic studies. Dataset downloadable for offline analysis.
Historical Corpora
COHA
Corpus of Historical American English. Fiction, magazines, newspapers, and non-fiction for tracking language change over two centuries.
Early English Books Online (EEBO)
Covers the beginnings of English printing. Critical for historical linguistics and early modern English research.
Corpus of English Dialogues (CED)
Trial proceedings, drama, and didactic works representing real and simulated conversation in Early Modern English.
Specialised Corpora
MICASE
Michigan Corpus of Academic Spoken English. Lectures, seminars, and study groups. Searchable online.
VOICE
Vienna-Oxford International Corpus of English. Face-to-face interaction in English as a Lingua Franca.
CHILDES
Child Language Data Exchange System. Transcription standards and CLAN analysis tools for language acquisition research.
EFCAMDAT
EF-Cambridge Open Language Database. Large-scale learner corpus for L2 English research.
Multilingual and Parallel Corpora
Universal Dependencies
Syntactically annotated treebanks with cross-linguistically consistent annotation. Regular releases.
Leipzig Corpora Collection
Freely downloadable web-crawled text corpora in 136 languages of various sizes.
OPUS
Open parallel corpus collection: movie subtitles, Bible translations, EU documents, OpenSubtitles, and more.
Europarl
European Parliament proceedings — large-scale, sentence-aligned parallel corpus for translation research and MT.
Blogs and Communities
Guillaume Desagulier's corpus linguistics notebook — usage-based methods, R for corpus analysis, cognitive linguistics, construction grammar.
Companion to Doing Linguistics with a Corpus by Egbert, Larsson & Biber. Corpus methodology, research design, statistical issues, methodological debates.
Long-running discussion list for the corpus linguistics community — tool announcements, conference info, job postings, corpus releases.
Data science, topic modelling, deep learning, and learning analytics with practical tutorials.
Social media data collection, workshops, tool updates, open office hours, and research methods from QUT's Digital Observatory.
Thoughtful commentary on statistics and data science from Jeff Leek, Roger Peng, and Rafa Irizarry.
News aggregator for the DH community — project highlights, announcements, and curated content.
Long-running digital humanities mailing list (since 1987). Thoughtful, sustained discussions on DH theory and practice.
Additional Tools
Visualisation
ggplot2 (R)
Grammar of Graphics implementation for R. Publication-quality, highly customisable plots. The standard for R visualisation.
Gephi
Interactive network visualisation for large graphs. Open source, widely used for social network and co-occurrence analysis.
Plotly (R & Python)
Interactive, web-ready graphs in both R and Python. Excellent for sharing interactive visualisations online.
Shiny (R)
Build interactive web apps from R — no web development skills needed. Great for sharing research tools.
Annotation Tools
WebAnno
Web-based multi-user annotation with inter-annotator agreement metrics. Supports many annotation types.
INCEpTION
Semantic annotation with knowledge base integration and active learning recommendations. Open source.
Prodigy
Modern annotation tool with active learning for rapid, efficient data labelling. From the makers of spaCy.
Data Management
Open Science Framework (OSF)
Project management, preregistration, DOI minting, and version control for open science workflows.
Zenodo
Long-term data preservation with DOI minting. GitHub integration. Free storage for datasets and code.
GitHub
Version control, collaboration, and open science. Host code, data, and project websites. Essential for reproducible research.
Machine Learning and Deep Learning
Hugging Face
Transformers library, pre-trained models (BERT, GPT, RoBERTa, XLM), datasets, and a community model hub. The go-to platform for modern NLP.
PyTorch
Research-friendly deep learning framework with dynamic computation graphs. Growing adoption in NLP research.
TensorFlow / Keras
Google's deep learning framework with the Keras high-level API. Production-friendly and widely used for NLP applications.
Getting Started
Specialise and share
- Choose your focus area
- Join a community
- Start a research project
- Contribute back
Python Visualisation
matplotlib
Foundation visualisation library for Python. Highly customisable, publication-quality, and the basis for many other libraries.
seaborn
Statistical visualisation with beautiful defaults, built on matplotlib. Great for data exploration.
Altair
Declarative, interactive visualisation for Python based on Vega-Lite. Clean grammar-of-graphics approach.
Cytoscape
Network analysis and visualisation with biological and linguistic applications. Extensible plugin ecosystem.
Python Text Analysis Frameworks
scikit-learn
Machine learning for Python. Text classification, clustering, and feature extraction with a consistent, well-documented API.
Gensim
Topic modelling, word embeddings (Word2Vec, fastText), and document similarity at scale. Efficient with large corpora.
TextBlob
Simple NLP in Python: sentiment analysis, part-of-speech tagging, translation. A great starting point for beginners.
R Text Analysis Frameworks
quanteda
Comprehensive, fast, and well-documented text analysis for R. Our recommended choice for corpus analysis.
tidytext
Tidy data principles applied to text mining. Integrates seamlessly with dplyr and ggplot2.
tm (text mining)
Established R package for text mining. Document-term matrices, preprocessing, and broad tool compatibility.
Documentation and Publishing
Jupyter Book
Create beautiful, interactive online books directly from Jupyter notebooks. Ideal for sharing reproducible research.
Read the Docs
Documentation hosting with Sphinx integration and version support. Standard platform for open-source project docs.
Dataverse
Open-source data repository network used by universities and research institutions. Supports data citation and archiving.
GitLab
Alternative to GitHub with private repositories, built-in CI/CD, and self-hosting options.
Awards and Showcases
Digital Humanities Awards
Annual community-voted recognition of excellent DH work across categories including Best Tool, Best Dataset, Best Visualisation, Best Training Materials, and Best Use of DH for Fun. A great way to discover cutting-edge projects.
Humanities Commons
Academic social network for sharing research, discovering projects, and collaborating across the humanities. Open access, community-governed.
King's Digital Lab
Software development, digital humanities projects, research infrastructure, and training at King's College London.
Nebraska Literary Lab
Digital humanities research, text analysis projects, and teaching resources at the University of Nebraska.
Stay Updated
Conferences
Corpus Linguistics: ICAME (International Computer Archive of Modern and Medieval English) · Corpus Linguistics (biennial) · CLUK (Corpus Linguistics in the UK)
Digital Humanities: DH (Digital Humanities conference) · TEI (Text Encoding Initiative) · ADHO (Alliance of Digital Humanities Organizations) · DHd Blog (German DH community, multilingual posts)
NLP/Computational Linguistics: ACL · NAACL · EMNLP · COLING
Browse upcoming events at linguistic-conferences.org and dh-abstracts.library.cmu.edu.
Contributing a Tutorial
Want to add a tutorial to LADAL? We welcome contributions from researchers and practitioners at all career stages.
Download templates: Tutorial template (.Rmd) · Bibliography file (.bib)
Required elements: Clear learning objectives · Setup instructions · Worked examples with code · Exercises with solutions · sessionInfo() for reproducibility
Submission: Develop your tutorial, test all code, then email your .Rmd file and any supporting data files to ladal@uq.edu.au. We’ll review, suggest any revisions, and publish with full attribution.
Code style: Function and package names in backticks · Consistent indentation · Comments on complex operations · Use library() not require()
Tables: Use flextable · Clear captions · Appropriate width settings
Exercises: Use the standard exercise block with collapsible <details> answer sections so readers can attempt the problem before revealing the solution.
Citation format for your tutorial:
YourLastName, YourFirstName. 2026. *The Title of Your Tutorial*.
Your Institution. url: https://ladal.edu.au/tutorials/.../....html (Version 2026.MM.DD).
BibTeX entry:
@manual{yourlastname2026topic,
author = {YourLastName, YourFirstName},
title = {The Title of Your Tutorial},
note = {https://slcladal.github.io/shorttitleofyourtutorial.html},
year = {2026},
organization = {Your Affiliation},
address = {Your Location},
edition = {2026.MM.DD}
}References
Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user. Cambridge University Press.
Anthony, L. (2004). AntConc: A learner and classroom friendly, multi-platform corpus analysis toolkit. IWLeL 2004, 7–13.
McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.
Walsh, M. (2021). Introduction to Cultural Analytics & Python. https://melaniewalsh.github.io/Intro-Cultural-Analytics/
Wickham, H., & Grolemund, G. (2016). R for data science. O’Reilly Media.
Last updated: 2026-02-08 · Maintained by: The LADAL Team · Contribute: Contact us