RESOURCES

Essential Resources for Language Technology and Data Science
This curated collection brings together the best resources for language technology, text analytics, corpus linguistics, natural language processing, and computational methods in the humanities and social sciences. Whether you’re a complete beginner or an experienced researcher, you’ll find tools, tutorials, datasets, and communities to support your work.
The Language Technology and Data Analysis Laboratory (LADAL) is part of LDaCA (Language Data Commons of Australia), providing tutorials, tools, and training materials for language data science and text analytics. This resources page serves as a comprehensive guide to the broader ecosystem of language technology resources available to researchers worldwide.
- Browse by category using the table of contents
- Start with Tools if you need software immediately
- Explore Courses for structured learning
- Join Communities to connect with others
- Check Datasets for practice materials
- Bookmark this page as your go-to resource hub!
Platforms and Infrastructure
LDaCA: Language Data Commons of Australia

LADAL is part of the Language Data Commons of Australia (LDaCA), a national research infrastructure providing access to language data and text analysis tools for Australian researchers.
What LDaCA Offers:
- Data discovery: Search and access diverse language datasets
- Jupyter Notebooks: Interactive coding environment, no installation needed
- Analysis tools: Ready-to-use text analysis capabilities
- Data collections: Curated corpora from Australian sources
- Training and support: Workshops, tutorials, documentation
- Community: Connect with language data researchers nationally
- Standards: FAIR data principles and metadata standards
LDaCA’s Evolution:
LDaCA emerged from the merger of the Australian Text Analytics Platform (ATAP) and PARADISEC, bringing together text analytics infrastructure and endangered language archives to create a comprehensive language data ecosystem.
Key Features:
- Accessible to researchers without strong coding backgrounds
- Combines language corpora with analysis tools
- Includes Australian English, Indigenous languages, and migrant languages
- Supports diverse research communities
- Free for Australian researchers
LDaCA Data Collections:
- Australian language corpora
- Social media datasets
- Historical texts
- Indigenous language materials
- Oral history collections
- Migrant language resources
Analysis Capabilities:
- Text preprocessing pipelines
- Concordancing and corpus analysis
- Topic modeling
- Sentiment analysis
- Named entity recognition
- Network analysis for texts
Visit ldaca.edu.au to:
1. Explore available data collections
2. Create a free account
3. Access Jupyter notebooks and analysis tools
4. Search for language datasets
5. Book training sessions
6. Join the LDaCA community
Software Tools
Concordancing and Corpus Analysis
AntConc and AntLab Suite

AntConc is the most widely-used free concordancing tool worldwide, developed by Laurence Anthony.
Core Features:
- Concordance: KWIC displays with powerful search
- Collocates: Find words that appear together
- Word clusters: N-gram analysis
- Keyword analysis: Compare corpora
- Dispersion plots: Visualize word distribution
Why AntConc:
- ✅ Completely free
- ✅ Cross-platform (Windows, Mac, Linux)
- ✅ No installation hassles
- ✅ Perfect for teaching and learning
- ✅ Handles large corpora efficiently
- ✅ Extensive documentation and tutorials
Other Tools in AntLab:
Laurence Anthony’s AntLab offers an impressive suite of specialized tools:
- AntFileConverter: Convert PDF, Word, Excel to plain text
- AntFileSplitter: Divide large files into manageable chunks
- AntGram: N-gram and word frequency analysis
- AntPConc: Parallel concordancer for translation studies
- AntWordProfiler: Vocabulary profiling and analysis
- AntMover: Genre and move analysis
- EncodeAnt: Character encoding conversion
- VariAnt: Spelling variant analysis
- FireAnt: Download and organize online texts
Getting Started:
1. Download from laurenceanthony.net
2. Load plain text files or folders
3. Click “Concordance” tab
4. Enter search term
5. Explore your results!
Tutorial: See our Concordancing Tutorial for R-based concordancing.
#LancsBox
#LancsBox is a next-generation corpus analysis toolkit from Lancaster University combining ease-of-use with advanced functionality.
Key Features:
- Modern, intuitive interface
- GraphColl: Visualize collocational networks
- Whelk: Powerful regex search
- Wizard: Guided analysis for beginners
- KWIC concordances with rich context
- Built-in statistical tests
- Multi-language support
Advantages:
- Free for academic use
- Easier learning curve than AntConc for some users
- Beautiful visualizations
- Integrated help system
Sketch Engine
SketchEngine is a comprehensive commercial platform with web access and extensive pre-loaded corpora.
Unique Features:
- Word Sketches: Grammatical/collocational summaries at a glance
- 90+ languages: Pre-loaded corpora
- Corpus building: Upload and process your texts
- Terminology extraction: Automatic keyword identification
- Parallel corpora: Translation studies support
- Collaborative: Team projects and sharing
Best For:
- Multilingual research
- Professional corpus linguistics
- Large-scale projects
- Teams needing shared resources
Pricing: Free trial, then subscription (academic discounts available)
WordSmith Tools
WordSmith Tools by Mike Scott is the professional standard for corpus analysis, trusted by thousands of researchers.
Core Components:
- Concord: Concordancing with sophisticated search
- KeyWords: Statistical keyword extraction
- WordList: Frequency lists and statistics
- Dispersion plots and distribution analysis
- Detailed collocate analysis
- Batch processing for multiple files
Why WordSmith:
- Industry standard since 1996
- Extremely powerful
- Publication-quality statistics
- Comprehensive documentation
- Active user community
Note: Windows only (runs on Mac via Wine), reasonably priced license
Online Concordancers
No installation needed:
BYU Corpora Family - english-corpora.org
- COCA (Contemporary American English): 1 billion words
- COHA (Historical American English): 400 million words, 1820s-present
- NOW Corpus: Web text, constantly updated
- TV Corpus, Movie Corpus, Wikipedia Corpus
- Free with registration, powerful search interface
Lextutor - lextutor.ca
- Free web-based concordancers
- Vocabulary profilers
- Excellent for language learning
- Multiple corpora available
Sketch Engine Web - sketchengine.eu
- Access without software installation
- Pre-loaded mega-corpora
- Free trial available
Text Analysis and NLP Tools
voyant Tools
Voyant Tools is a web-based reading and analysis environment for digital texts.
Features:
- Zero installation—works in browser
- Upload texts instantly
- Multiple visualization types
- Word clouds, trend graphs, networks
- Collaborative analysis
- Embed visualizations in websites
Perfect For:
- Digital humanities
- Quick exploratory analysis
- Teaching text analysis
- Public-facing projects
- Students and beginners
Getting Started:
Visit voyant-tools.org, paste text or upload files, and start exploring!
Orange Data Mining
Orange is a visual programming tool for data mining, machine learning, and text analytics.
Text Add-on Features:
- Preprocessing pipelines
- Topic modeling (LDA)
- Sentiment analysis
- Document clustering
- Word clouds and visualizations
- Machine learning classification
Why Orange:
- Visual workflow design (drag and drop)
- No coding required
- Powerful machine learning
- Great for teaching
- Free and open source
GATE (General Architecture for Text Engineering)
GATE is a powerful open-source platform for text processing, especially NLP.
Capabilities:
- Information extraction
- Named entity recognition
- Relation extraction
- Opinion mining
- Semantic annotation
- Language identification
Best For:
- Large-scale text processing
- Custom NLP pipelines
- Research requiring annotation
- Multilingual projects
Stanford CoreNLP
Stanford CoreNLP provides a suite of NLP tools with state-of-the-art accuracy.
Tools Include:
- Tokenization
- Part-of-speech tagging
- Named entity recognition
- Parsing
- Coreference resolution
- Sentiment analysis
Access Methods:
- Online demo
- Command-line interface
- Java library
- Python wrapper (stanza)
- R integration
spaCy
spaCy is an industrial-strength NLP library in Python, fast and production-ready.
Features:
- Tokenization, POS tagging, parsing
- Named entity recognition
- Word vectors and similarity
- Pipeline components
- Model training
- Visualization (displaCy)
Languages: 60+ with pre-trained models
Why spaCy:
- Extremely fast
- Easy to use
- Well-documented
- Active development
- Large community
NLTK (Natural Language Toolkit)
NLTK is a leading platform for building Python programs to work with human language data.
Strengths:
- Educational focus
- Comprehensive tutorials
- Many datasets included
- Wide range of algorithms
- Excellent documentation
- Perfect for learning NLP
Covers:
- Tokenization and stemming
- POS tagging
- Parsing
- Classification
- Clustering
- Semantic reasoning
Specialized Analysis Tools
SMARTool: Russian Language Learning

SMARTool is a corpus-based tool specifically designed for English-speaking learners of Russian.
Features:
- Handles rich Russian morphology
- 3,000 basic vocabulary items
- Frequency-based learning
- Corpus-informed examples
- User-friendly interface
- Linguist-built, theory-informed
Perfect For:
- Russian language learners
- Teachers of Russian
- Applied linguistics research on Russian
BookNLP: Processing Literary Texts
BookNLP is an NLP pipeline specifically designed for books and long documents.
Specialized for Literature:
- Character name clustering (“Tom Sawyer” = “Tom” = “Mr. Sawyer”)
- Speaker identification in quotes
- Character coreference resolution
- Referential gender inference
- Event tagging
- Supersense tagging
Also Provides:
- POS tagging
- Dependency parsing
- Entity recognition
Models Available:
- Large model (GPU/multi-core)
- Small model (personal computers)
Perfect For:
- Digital humanities
- Literary analysis
- Narrative studies
- Character analysis
CLAWS (Constituent Likelihood Automatic Word-tagging System)
CLAWS from Lancaster University is a world-leading POS tagger with very high accuracy.
Achievements:
- Tagged the British National Corpus
- 96-97% accuracy
- Multiple tagsets available
- Web demo and batch processing
- Commercial licensing available
Semantic Tagging Tools
USAS (UCREL Semantic Analysis System)
- USAS
- Automatic semantic tagging
- Multi-word expression recognition
- 21 major discourse fields
- Multiple languages
WMatrix
- WMatrix
- Corpus comparison tool
- Combines POS, semantic tagging, keywords
- Statistical significance testing
- Web-based interface
Learning Resources and Courses
Comprehensive Courses
Applied Language Technology

Applied Language Technology from the University of Helsinki offers two excellent courses:
1. Working with Text in Python
- Python basics for text analysis
- Regular expressions
- Text processing
- No prior programming knowledge needed
2. Natural Language Processing for Linguists
- NLP fundamentals
- Machine learning basics
- Deep learning for NLP
- Practical applications
Features:
- Free and open access
- Designed for linguists and humanists
- Hands-on exercises
- Jupyter notebooks
- Self-paced learning
Cultural Analytics with Python

Introduction to Cultural Analytics & Python by Melanie Walsh is an outstanding textbook for humanities and social sciences.
Topics Covered:
- Python programming basics
- Text analysis and natural language processing
- Social media analysis
- Network analysis
- Mapping and spatial analysis
- Web scraping
- Data visualization
Why It’s Excellent:
- Written specifically for humanities scholars
- Clear explanations
- Lots of examples
- Engaging datasets
- Free and online
- Continuously updated
Perfect For:
- Digital humanities students
- Social science researchers
- Anyone wanting practical skills
Programming Historian
The Programming Historian is a peer-reviewed, collaborative blog with lessons for digital humanists.
Lesson Categories:
- Data management
- Data manipulation
- Distant reading
- Getting ready to program
- Linked open data
- Mapping and GIS
- Network analysis
- Digital publishing
- Web scraping
- Data visualization
Languages:
- Python
- R
- JavaScript
- And more
Available In:
- English
- Spanish
- French
- Portuguese
Why It’s Trusted:
- Peer-reviewed lessons
- High quality control
- Practical focus
- Clear tutorials
- Open access
Corpus Linguistics Courses
Lancaster Summer Schools in Corpus Linguistics
- Lancaster University
- Annual intensive courses
- Beginner to advanced
- Hands-on training
- Expert instructors
Corpus Linguistics: Method, Analysis, Interpretation (FutureLearn)
- Free online course
- Lancaster University
- Six-week course
- Flexible learning
YouTube Channels:
- Laurence Anthony - AntConc tutorials
- Linguistics with a Corpus - Various corpus methods
- Data Science Dojo - Text mining and NLP
R and Statistics
Quick-R
- statmethods.net
- By DataCamp’s Rob Kabacoff
- R code snippets
- Data management
- Statistics
- Data visualization
- Great for quick reference
STHDA (Statistical Tools for High-Throughput Data Analysis)
- sthda.com
- Comprehensive R tutorials
- Statistical methods
- Data visualization with ggplot2
- Machine learning
R for Data Science
- r4ds.had.co.nz
- By Hadley Wickham & Garrett Grolemund
- Modern data science workflow
- Tidyverse ecosystem
- Free online book
Advanced R
- adv-r.hadley.nz
- By Hadley Wickham
- Deep dive into R programming
- For experienced R users
Text Mining with R
- tidytextmining.com
- By Julia Silge & David Robinson
- Tidy approach to text analysis
- Practical examples
Quantitative Text Analysis in R
- quanteda tutorials
- Official quanteda documentation
- Comprehensive corpus analysis
- Step-by-step guides
Specialized Training
GLAM Workbench
GLAM Workbench provides tools and tutorials for working with data from galleries, libraries, archives, and museums.
Focus:
- Australia and New Zealand (expanding)
- Cultural heritage data
- Jupyter notebooks
- Interactive learning
Collections:
- National Library of Australia
- Trove
- State libraries
- Archives
- Museums
Features:
- Run live on Binder: Click and go—no installation!
- Real data from GLAM institutions
- Reproducible research
- Community contributions welcome
TAPoR 3: Text Analysis Portal
TAPoR 3 is a curated directory of research tools for text analysis.
What TAPoR Offers:
- 1,500+ tools listed
- Categories and tags
- Tool descriptions and reviews
- Comparison features
- Community curation
Tool Categories:
- Analyze
- Interpret
- Visualize
- Manipulate
- Create
- Explore
Search by:
- Analysis type
- Platform
- Programming language
- Cost
- Discipline
Perfect For:
- Discovering new tools
- Comparing options
- Finding specialized tools
- Staying updated
Research Centers and Labs
Leading Corpus Linguistics Centers
Lancaster University Corpus Linguistics
UCREL (University Centre for Computer Corpus Research on Language)
- ucrel.lancs.ac.uk
- World-leading corpus linguistics research
- Developers of CLAWS, USAS, Wmatrix
- British National Corpus
- Training and resources
CASS (Corpus Approaches to Social Science)
- Applying corpus methods to social science
- Discourse analysis
- Critical discourse analysis
VARIENG: Helsinki Corpus Research

VARIENG - Research Unit for Variation, Contacts and Change in English
Focus Areas:
- Language variation
- Language contact
- Language change
- Corpus compilation
- Sociolinguistics
Major Corpora:
- Helsinki Corpus of English Texts
- Corpus of Early English Correspondence
- Parsed corpora
Sydney Corpus Lab

Sydney Corpus Lab promotes corpus linguistics in Australia.
Mission:
- Build research capacity
- Connect Australian corpus linguists
- Promote method across disciplines
- Virtual lab fostering collaboration
Activities:
- Workshops and training
- Research collaboration
- Resource sharing
- Community building
Text Crunching Centre (TCC)

Text Crunching Centre at University of Zurich provides NLP expertise as a service.
Services:
- NLP consulting
- Custom tool development
- Text processing pipelines
- Named entity recognition
- Sentiment analysis
- Topic modeling
For:
- UZH researchers
- External partners
- Commercial clients
Language Acquisition and Variation
AcqVA Aurora Lab

AcqVA Aurora Lab at UiT The Arctic University of Norway
Research Areas:
- Language acquisition
- Language variation
- Language attrition
- Psycholinguistics
Support:
- Methodological consultation
- Data collection facilities
- Analysis support
Digital Humanities Centers
Stanford Literary Lab
Stanford Literary Lab
- Computational literary studies
- Quantitative methods
- Research publications
- Innovative approaches
Nebraska Literary Lab
Nebraska Literary Lab
- Digital humanities research
- Text analysis projects
- Teaching resources
King’s Digital Lab
King’s Digital Lab
- Software development
- Digital humanities projects
- Research infrastructure
- Training and support
Media and Communication Research
Media Research Methods Lab

Media Research Methods Lab at Leibniz Institute for Media Research
Focus:
- Computational social science
- Social media analysis
- Automated content analysis
- Network analysis
Methods:
- Survey research
- Experiments
- Content analysis
- Log data analysis
- Experience sampling
National Centre for Text Mining (NaCTeM)
NaCTeM - UK’s first publicly-funded text mining centre
Services:
- Text mining tools
- Literature-based discovery
- Biological database curation
- Clinical informatics
Resources:
- Software tools
- Training materials
- Publications
Corpora and Datasets
Major English Corpora
British National Corpus (BNC)
BNC
- 100 million words
- 1980s-1990s British English
- Written and spoken
- POS tagged
- Gold standard reference corpus
BNC2014
- 100 million words
- 2010s British English
- Conversation focus
- Comparable to original BNC
Corpus of Contemporary American English (COCA)
COCA
- 1 billion words (1990-2019)
- Balanced: spoken, fiction, popular magazines, newspapers, academic
- Updated regularly
- Genre and time filtering
- Free access
International Corpus of English (ICE)
ICE
- Parallel corpora from 20+ countries
- 1 million words per variety
- Spoken and written
- Comparable structure
- Grammatically annotated
ICE Components Include:
- ICE-GB (Britain)
- ICE-USA (United States)
- ICE-Ireland
- ICE-Canada
- ICE-India
- ICE-Hong Kong
- And many more
Google Books Ngram Corpus
Google Books Ngram Viewer
- Trillions of words
- 1500-2019
- Multiple languages
- Phrase frequency over time
- Great for diachronic studies
Dataset Available:
- Download for offline analysis
- Excellent for language change research
Historical Corpora
Corpus of Historical American English (COHA)
COHA
- 400 million words
- 1820s-2000s
- Balanced by decade
- Fiction, magazines, newspapers, non-fiction
- Track language change
Early English Books Online (EEBO)
EEBO
- 1473-1700
- 25,000+ texts
- English printing beginnings
- Critical for historical linguistics
Corpus of English Dialogues (CED)
Compiled at University of Helsinki
- 1560-1760
- Trial proceedings
- Drama
- Didactic works
- Real/simulated conversation
Specialized Corpora
MICASE (Michigan Corpus of Academic Spoken English)
MICASE
- 1.8 million words
- Academic speech
- Lectures, seminars, study groups
- Searchable online
VOICE (Vienna-Oxford International Corpus of English)
VOICE
- 1 million words
- English as a Lingua Franca
- Speakers from 50+ first languages
- Face-to-face interaction
CHILDES (Child Language Data Exchange System)
CHILDES
- Child language acquisition data
- 50+ languages
- Longitudinal corpora
- Transcription standards
- Analysis tools (CLAN)
Learner Corpora
ICLE (International Corpus of Learner English)
- University-level learners
- Multiple L1 backgrounds
- Academic essays
EF-Cambridge Open Language Database
- EFCAMDAT
- Large-scale learner corpus
- 70 million words
- 150+ countries
Multilingual Resources
Universal Dependencies
Universal Dependencies
- 100+ languages
- Syntactically annotated treebanks
- Cross-linguistically consistent
- Open source
- Regular releases
Leipzig Corpora Collection
Corpora.uni-leipzig.de
- 136 languages
- Web-crawled texts
- Various sizes
- Freely downloadable
Parallel Corpora
OPUS (Open Parallel Corpus)
- opus.nlpl.eu
- Translation memory corpora
- 90+ languages
- Movie subtitles, OpenSubtitles
- Bible translations
- EU documents
- Free download
Europarl
- European Parliament proceedings
- 21 languages
- Sentence-aligned
- Large-scale
Blogs and Communities
Corpus Linguistics Blogs
Around the Word
Hypotheses - Guillaume Desagulier’s corpus linguistics notebook
Topics:
- Usage-based corpus linguistics
- R for corpus analysis
- Cognitive linguistics
- Construction grammar
- Practical methods
Features:
- Code examples
- Methodological reflections
- Experimental approaches
Linguistics with a Corpus
Companion blog to Doing Linguistics with a Corpus by Egbert, Larsson & Biber
Topics:
- Corpus methodology
- Research design
- Statistical issues
- Best practices
Community:
- Open discussion
- Reader questions
- Methodological debates
Corpora List
Corpora Mailing List
- Long-running discussion list
- Corpus linguistics community
- Job postings
- Conference announcements
- Tool discussions
- Corpus releases
Data Science and NLP Blogs
Aneesha Bakharia
Topics:
- Data science
- Topic modeling
- Deep learning
- Learning analytics
- Practical tutorials
Digital Observatory Blog
Digital Observatory at Queensland University of Technology
Content:
- Social media data collection
- Workshops and training
- Tool updates
- Open office hours
- Research methods
Probably Overthinking It
Allen Downey’s Blog
- Bayesian statistics
- Data science
- Clear explanations
- Python examples
Simply Statistics
simplystatistics.org
- Statistics and data science
- By academic leaders (Jeff Leek, Roger Peng, Rafa Irizarry)
- Thoughtful commentary
Digital Humanities
Digital Humanities Now
digitalhumanitiesnow.org
- News aggregator
- DH community
- Project highlights
- Curated content
DHd Blog
DHd-Blog
- German DH community
- Multilingual posts
- Method discussions
- Project showcases
Awards and Showcases
Digital Humanities Awards
Digital Humanities Awards - Annual recognition of excellent DH work
Categories:
- Best Exploration of DH Failure/Limitations
- Best DH Data Visualization
- Best Use of DH for Fun
- Best DH Dataset
- Best DH Short Publication
- Best DH Tool or Suite of Tools
- Best DH Training Materials
- Special categories (vary yearly)
Why Follow:
- Discover cutting-edge projects
- Find high-quality resources
- See what the community values
- Get inspired
Platforms for Showcasing Work
Humanities Commons
hcommons.org
- Academic social network
- Share research
- Discover projects
- Collaborate
DHCommons
Project registry and showcase
- Find collaborators
- Share projects
- Get feedback
Additional Resources
Visualization Tools
R Visualization
ggplot2
- ggplot2.tidyverse.org
- Grammar of Graphics
- Publication-quality plots
- Highly customizable
plotly
- plotly.com/r
- Interactive graphs
- Web-ready
- R and Python
Shiny
- shiny.rstudio.com
- Interactive web apps
- R-based
- No web dev needed
Python Visualization
matplotlib
- Foundation library
- Highly customizable
- Publication quality
seaborn
- Statistical visualization
- Beautiful defaults
- Built on matplotlib
altair
- Declarative visualization
- Interactive
- Grammar of graphics
Network Visualization
Gephi
- gephi.org
- Interactive network visualization
- Large graph support
- Open source
Cytoscape
- cytoscape.org
- Network analysis
- Biological focus
- Plugin ecosystem
Annotation Tools
WebAnno
webanno.github.io
- Web-based annotation
- Multiple users
- Various annotation types
- Inter-annotator agreement
brat
brat.nlplab.org
- Browser-based annotation tool
- Linguistic annotation
- Entity relationships
- Coreference chains
Prodigy
prodi.gy
- Modern annotation tool
- Active learning
- Quick setup
- Commercial (with educational discount)
INCEpTION
inception-project.github.io
- Semantic annotation
- Knowledge base integration
- Recommendation
- Open source
Data Management
Version Control
Git and GitHub
- github.com
- Code versioning
- Collaboration
- Open science
GitLab
- gitlab.com
- Alternative to GitHub
- Private repositories
- CI/CD integration
Research Data Management
OSF (Open Science Framework)
- osf.io
- Project management
- Preregistration
- DOI minting
- Version control
Zenodo
- zenodo.org
- Long-term data preservation
- DOI for datasets
- GitHub integration
- Free storage
Dataverse
- dataverse.org
- Data repository network
- Multiple institutions
- Data citation
Documentation
Read the Docs
- readthedocs.org
- Documentation hosting
- Sphinx integration
- Versioning
Jupyter Book
- jupyterbook.org
- Book from notebooks
- Beautiful online books
- Open source
Text Analysis Frameworks
Python
scikit-learn
- Machine learning
- Text classification
- Clustering
- Feature extraction
Gensim
- Topic modeling
- Word embeddings
- Document similarity
- Large corpus efficiency
TextBlob
- Simple NLP
- Sentiment analysis
- Translation
- Great for beginners
R Frameworks
quanteda
- Comprehensive text analysis
- Fast and efficient
- Well-documented
- Our recommended choice
tidytext
- Tidy data principles
- Integrates with dplyr/ggplot2
- Text mining
- Sentiment analysis
tm (text mining)
- Established package
- Document-term matrices
- Preprocessing
- Compatible with many tools
Machine Learning Platforms
Hugging Face
huggingface.co
- Transformers library
- Pre-trained models
- Datasets
- Model Hub
- Community-driven
Popular Models:
- BERT, GPT, RoBERTa
- T5, BART, XLM
- Language-specific models
TensorFlow and Keras
tensorflow.org
- Deep learning framework
- Keras high-level API
- Production deployment
- Large community
PyTorch
pytorch.org
- Deep learning framework
- Research-friendly
- Dynamic computation graphs
- Growing adoption
Getting Started Guide
For Complete Beginners
Week 1-2: Explore without installing
1. Try Voyant Tools
2. Search COCA
3. Watch AntConc tutorials
Week 3-4: Install first tools
1. Download AntConc
2. Follow a simple tutorial
3. Analyze your own texts
Month 2: Learn programming basics
1. Start Introduction to Cultural Analytics
2. OR Applied Language Technology
3. Practice with small projects
Month 3+: Specialize
1. Choose your focus (corpus linguistics, NLP, digital humanities)
2. Join relevant communities
3. Start a research project
4. Share your work
For Researchers
Setting Up Your Infrastructure:
1. Version control: Create GitHub account
2. Coding environment: Install R/Python/both
3. Editor: RStudio and/or Jupyter
4. Organization: Adopt project structure
5. Documentation: Start research notebook
Essential Skills to Develop:
- Regular expressions
- Data cleaning and preprocessing
- Statistical basics
- Visualization
- Reproducible research practices
Building Your Toolkit:
- Start with GUI tools (AntConc)
- Graduate to programming (R/Python)
- Master one language well
- Learn complementary tools as needed
For Teachers
Resources for Teaching:
Beginner-Friendly:
- Voyant Tools (no installation)
- AntConc (easy to use)
- Online concordancers
- GLAM Workbench (interactive notebooks)
Intermediate:
- Programming Historian lessons
- Cultural Analytics textbook
- Applied Language Technology course
- LADAL tutorials
Advanced:
- Research projects
- GitHub for collaboration
- Publication-ready analysis
Teaching Tips:
- Start with questions, not tools
- Use data students care about
- Scaffold from GUI to code
- Emphasize reproducibility
- Encourage sharing
Contributing to LADAL
Want to contribute a tutorial to LADAL? We welcome new authors!
Style Guide
Getting Started
Download Templates:
- Tutorial template
- Bibliography file
Required Elements:
1. Clear learning objectives
2. Installation/setup instructions
3. Worked examples with code
4. Exercises with solutions
5. Session info for reproducibility
Formatting Standards
Headers:
- Level 1: Numbered automatically
- Lower levels: Add {-} to suppress numbering
Code Style:
- Function and package names in backticks: function()
- Consistent spacing and indentation
- Comments explaining complex operations
- Use library() not require()
Emphasis:
- Use italics for emphasis
- Bold sparingly
- code font for technical terms
Tables:
- Use flextable for display
- Clear captions
- Appropriate width settings
Exercises:
Use the standard exercise format:
***
<div class="warning">
<p style='margin-top:1em; text-align:center'>
<b>EXERCISE TIME!</b>
</p>
</div>
<div class="question">
1. Exercise question here
<details>
<summary>Answer</summary>
\`\`\`{r}
# Solution code
\`\`\`
</details>
</div>
*** Content Guidelines
Clarity:
- Explain concepts before technical details
- Use concrete examples
- Define jargon on first use
- Anticipate confusion
Pedagogy:
- Build progressively
- Provide context and motivation
- Multiple examples at different levels
- Practice opportunities throughout
Reproducibility:
- Complete code provided
- Package versions documented
- Data sources cited
- sessionInfo() at end
Accessibility:
- Assume minimal prior knowledge
- Clear instructions
- Multiple learning paths
- Supportive tone
Citation Format
At the end of your tutorial:
Your last name, your first name. 2025. *The title of your tutorial*.
Your location: your affiliation (if you have one).
url: https://slcladal.github.io/shorttitleofyourtutorial.html
(Version 2025.MM.DD).
BibTeX entry:
@manual{yourlastname2025topic,
author = {YourLastName, YourFirstName},
title = {The Title of Your Tutorial},
note = {https://slcladal.github.io/shorttitleofyourtutorial.html},
year = {2025},
organization = {Your Affiliation},
address = {Your Location},
edition = {2025.MM.DD}
} Session Info:
sessionInfo() Submission Process
Steps:
1. Develop your tutorial following the template
2. Test all code thoroughly
3. Proofread carefully
4. Email to LADAL team with:
- Your Rmd file
- Any required data files
- Brief description
- Your bio (optional)
We’ll:
1. Review for quality and fit
2. Suggest any revisions
3. Integrate into LADAL site
4. Credit you as author
5. Promote your tutorial
Questions?
Contact us through the LADAL website or email.
Stay Updated
Mailing Lists
Corpora List: Subscribe
- Corpus linguistics community
- Tool announcements
- Conference info
- Job postings
Humanist Discussion Group: Subscribe
- Digital humanities
- Long-running (since 1987!)
- Thoughtful discussions
Conferences
Corpus Linguistics:
- ICAME (International Computer Archive of Modern and Medieval English)
- Corpus Linguistics (biennial)
- CLUK (Corpus Linguistics in the UK)
Digital Humanities:
- DH (Digital Humanities conference)
- TEI (Text Encoding Initiative)
- ADHO (Alliance of Digital Humanities Organizations)
NLP/Computational Linguistics:
- ACL (Association for Computational Linguistics)
- NAACL (North American Chapter of ACL)
- EMNLP (Empirical Methods in NLP)
- COLING (International Conference on Computational Linguistics)
Check:
- linguistic-conferences.org
- dh-abstracts.library.cmu.edu
Conclusion
This resource page brings together the best tools, tutorials, corpora, and communities for language technology and text analysis. Whether you’re just starting out or are an experienced researcher, these resources will support your work.
- Bookmark this page for easy reference
- Pick one resource that matches your current needs
- Join one community to connect with others
- Start a project applying what you learn
- Share your work and help others
- Contribute back by creating tutorials or tools
Remember:
- Start small and build gradually
- Focus on questions, not just tools
- Learn from the community
- Share your knowledge
- Reproducibility matters
- Have fun with your data!
The LADAL team is here to support you. Explore our tutorials, use our tools, and don’t hesitate to reach out with questions!
References
Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user. Cambridge University Press.
Anthony, L. (2004). AntConc: A learner and classroom friendly, multi-platform corpus analysis toolkit. IWLeL 2004: An Interactive Workshop on Language e-Learning, 7-13.
Baker, P. (2006). Using corpora in discourse analysis. Continuum.
McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.
Walsh, M. (2021). Introduction to Cultural Analytics & Python. https://melaniewalsh.github.io/Intro-Cultural-Analytics/
Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.
Last Updated: 2026-02-08
Maintained by: The LADAL Team
Contribute: Know a great resource we missed? Let us know!
Social Media and Web Corpora
Twitter Datasets
Availability:
- Twitter API (now X API)
- Academic research access
- Historical archives
Tools:
- Twarc: Twitter archiving
- Tweepy: Python library
- rtweet: R package
Reddit Datasets
Pushshift Reddit Dataset
- Comments and submissions
- 2005-present
- Searchable
- JSON format
Common Crawl
Common Crawl
- Petabytes of web data
- Monthly snapshots since 2008
- Free access
- Requires processing