RESOURCES

Essential Resources for Language Technology and Data Science

This curated collection brings together the best resources for language technology, text analytics, corpus linguistics, natural language processing, and computational methods in the humanities and social sciences. Whether you’re a complete beginner or an experienced researcher, you’ll find tools, tutorials, datasets, and communities to support your work.

About LADAL

The Language Technology and Data Analysis Laboratory (LADAL) is part of LDaCA (Language Data Commons of Australia), providing tutorials, tools, and training materials for language data science and text analytics. This resources page serves as a comprehensive guide to the broader ecosystem of language technology resources available to researchers worldwide.

How to Use This Page
  • Browse by category using the table of contents
  • Start with Tools if you need software immediately
  • Explore Courses for structured learning
  • Join Communities to connect with others
  • Check Datasets for practice materials
  • Bookmark this page as your go-to resource hub!

Platforms and Infrastructure

LDaCA: Language Data Commons of Australia

LADAL is part of the Language Data Commons of Australia (LDaCA), a national research infrastructure providing access to language data and text analysis tools for Australian researchers.

What LDaCA Offers:
- Data discovery: Search and access diverse language datasets
- Jupyter Notebooks: Interactive coding environment, no installation needed
- Analysis tools: Ready-to-use text analysis capabilities
- Data collections: Curated corpora from Australian sources
- Training and support: Workshops, tutorials, documentation
- Community: Connect with language data researchers nationally
- Standards: FAIR data principles and metadata standards

LDaCA’s Evolution:
LDaCA emerged from the merger of the Australian Text Analytics Platform (ATAP) and PARADISEC, bringing together text analytics infrastructure and endangered language archives to create a comprehensive language data ecosystem.

Key Features:
- Accessible to researchers without strong coding backgrounds
- Combines language corpora with analysis tools
- Includes Australian English, Indigenous languages, and migrant languages
- Supports diverse research communities
- Free for Australian researchers

LDaCA Data Collections:
- Australian language corpora
- Social media datasets
- Historical texts
- Indigenous language materials
- Oral history collections
- Migrant language resources

Analysis Capabilities:
- Text preprocessing pipelines
- Concordancing and corpus analysis
- Topic modeling
- Sentiment analysis
- Named entity recognition
- Network analysis for texts

Getting Started with LDaCA

Visit ldaca.edu.au to:
1. Explore available data collections
2. Create a free account
3. Access Jupyter notebooks and analysis tools
4. Search for language datasets
5. Book training sessions
6. Join the LDaCA community


Software Tools

Concordancing and Corpus Analysis

AntConc and AntLab Suite

AntConc is the most widely-used free concordancing tool worldwide, developed by Laurence Anthony.

Core Features:
- Concordance: KWIC displays with powerful search
- Collocates: Find words that appear together
- Word clusters: N-gram analysis
- Keyword analysis: Compare corpora
- Dispersion plots: Visualize word distribution

Why AntConc:
- ✅ Completely free
- ✅ Cross-platform (Windows, Mac, Linux)
- ✅ No installation hassles
- ✅ Perfect for teaching and learning
- ✅ Handles large corpora efficiently
- ✅ Extensive documentation and tutorials

Other Tools in AntLab:

Laurence Anthony’s AntLab offers an impressive suite of specialized tools:

  • AntFileConverter: Convert PDF, Word, Excel to plain text
  • AntFileSplitter: Divide large files into manageable chunks
  • AntGram: N-gram and word frequency analysis
  • AntPConc: Parallel concordancer for translation studies
  • AntWordProfiler: Vocabulary profiling and analysis
  • AntMover: Genre and move analysis
  • EncodeAnt: Character encoding conversion
  • VariAnt: Spelling variant analysis
  • FireAnt: Download and organize online texts
AntConc for Beginners

Getting Started:
1. Download from laurenceanthony.net
2. Load plain text files or folders
3. Click “Concordance” tab
4. Enter search term
5. Explore your results!

Tutorial: See our Concordancing Tutorial for R-based concordancing.

#LancsBox

#LancsBox is a next-generation corpus analysis toolkit from Lancaster University combining ease-of-use with advanced functionality.

Key Features:
- Modern, intuitive interface
- GraphColl: Visualize collocational networks
- Whelk: Powerful regex search
- Wizard: Guided analysis for beginners
- KWIC concordances with rich context
- Built-in statistical tests
- Multi-language support

Advantages:
- Free for academic use
- Easier learning curve than AntConc for some users
- Beautiful visualizations
- Integrated help system

Sketch Engine

SketchEngine is a comprehensive commercial platform with web access and extensive pre-loaded corpora.

Unique Features:
- Word Sketches: Grammatical/collocational summaries at a glance
- 90+ languages: Pre-loaded corpora
- Corpus building: Upload and process your texts
- Terminology extraction: Automatic keyword identification
- Parallel corpora: Translation studies support
- Collaborative: Team projects and sharing

Best For:
- Multilingual research
- Professional corpus linguistics
- Large-scale projects
- Teams needing shared resources

Pricing: Free trial, then subscription (academic discounts available)

WordSmith Tools

WordSmith Tools by Mike Scott is the professional standard for corpus analysis, trusted by thousands of researchers.

Core Components:
- Concord: Concordancing with sophisticated search
- KeyWords: Statistical keyword extraction
- WordList: Frequency lists and statistics
- Dispersion plots and distribution analysis
- Detailed collocate analysis
- Batch processing for multiple files

Why WordSmith:
- Industry standard since 1996
- Extremely powerful
- Publication-quality statistics
- Comprehensive documentation
- Active user community

Note: Windows only (runs on Mac via Wine), reasonably priced license

Online Concordancers

No installation needed:

BYU Corpora Family - english-corpora.org
- COCA (Contemporary American English): 1 billion words
- COHA (Historical American English): 400 million words, 1820s-present
- NOW Corpus: Web text, constantly updated
- TV Corpus, Movie Corpus, Wikipedia Corpus
- Free with registration, powerful search interface

Lextutor - lextutor.ca
- Free web-based concordancers
- Vocabulary profilers
- Excellent for language learning
- Multiple corpora available

Sketch Engine Web - sketchengine.eu
- Access without software installation
- Pre-loaded mega-corpora
- Free trial available

Text Analysis and NLP Tools

voyant Tools

Voyant Tools is a web-based reading and analysis environment for digital texts.

Features:
- Zero installation—works in browser
- Upload texts instantly
- Multiple visualization types
- Word clouds, trend graphs, networks
- Collaborative analysis
- Embed visualizations in websites

Perfect For:
- Digital humanities
- Quick exploratory analysis
- Teaching text analysis
- Public-facing projects
- Students and beginners

Getting Started:
Visit voyant-tools.org, paste text or upload files, and start exploring!

Orange Data Mining

Orange is a visual programming tool for data mining, machine learning, and text analytics.

Text Add-on Features:
- Preprocessing pipelines
- Topic modeling (LDA)
- Sentiment analysis
- Document clustering
- Word clouds and visualizations
- Machine learning classification

Why Orange:
- Visual workflow design (drag and drop)
- No coding required
- Powerful machine learning
- Great for teaching
- Free and open source

GATE (General Architecture for Text Engineering)

GATE is a powerful open-source platform for text processing, especially NLP.

Capabilities:
- Information extraction
- Named entity recognition
- Relation extraction
- Opinion mining
- Semantic annotation
- Language identification

Best For:
- Large-scale text processing
- Custom NLP pipelines
- Research requiring annotation
- Multilingual projects

Stanford CoreNLP

Stanford CoreNLP provides a suite of NLP tools with state-of-the-art accuracy.

Tools Include:
- Tokenization
- Part-of-speech tagging
- Named entity recognition
- Parsing
- Coreference resolution
- Sentiment analysis

Access Methods:
- Online demo
- Command-line interface
- Java library
- Python wrapper (stanza)
- R integration

spaCy

spaCy is an industrial-strength NLP library in Python, fast and production-ready.

Features:
- Tokenization, POS tagging, parsing
- Named entity recognition
- Word vectors and similarity
- Pipeline components
- Model training
- Visualization (displaCy)

Languages: 60+ with pre-trained models

Why spaCy:
- Extremely fast
- Easy to use
- Well-documented
- Active development
- Large community

NLTK (Natural Language Toolkit)

NLTK is a leading platform for building Python programs to work with human language data.

Strengths:
- Educational focus
- Comprehensive tutorials
- Many datasets included
- Wide range of algorithms
- Excellent documentation
- Perfect for learning NLP

Covers:
- Tokenization and stemming
- POS tagging
- Parsing
- Classification
- Clustering
- Semantic reasoning

Specialized Analysis Tools

SMARTool: Russian Language Learning

SMARTool is a corpus-based tool specifically designed for English-speaking learners of Russian.

Features:
- Handles rich Russian morphology
- 3,000 basic vocabulary items
- Frequency-based learning
- Corpus-informed examples
- User-friendly interface
- Linguist-built, theory-informed

Perfect For:
- Russian language learners
- Teachers of Russian
- Applied linguistics research on Russian

BookNLP: Processing Literary Texts

BookNLP is an NLP pipeline specifically designed for books and long documents.

Specialized for Literature:
- Character name clustering (“Tom Sawyer” = “Tom” = “Mr. Sawyer”)
- Speaker identification in quotes
- Character coreference resolution
- Referential gender inference
- Event tagging
- Supersense tagging

Also Provides:
- POS tagging
- Dependency parsing
- Entity recognition

Models Available:
- Large model (GPU/multi-core)
- Small model (personal computers)

Perfect For:
- Digital humanities
- Literary analysis
- Narrative studies
- Character analysis

CLAWS (Constituent Likelihood Automatic Word-tagging System)

CLAWS from Lancaster University is a world-leading POS tagger with very high accuracy.

Achievements:
- Tagged the British National Corpus
- 96-97% accuracy
- Multiple tagsets available
- Web demo and batch processing
- Commercial licensing available

Semantic Tagging Tools

USAS (UCREL Semantic Analysis System)
- USAS
- Automatic semantic tagging
- Multi-word expression recognition
- 21 major discourse fields
- Multiple languages

WMatrix
- WMatrix
- Corpus comparison tool
- Combines POS, semantic tagging, keywords
- Statistical significance testing
- Web-based interface


Learning Resources and Courses

Comprehensive Courses

Applied Language Technology

Applied Language Technology from the University of Helsinki offers two excellent courses:

1. Working with Text in Python
- Python basics for text analysis
- Regular expressions
- Text processing
- No prior programming knowledge needed

2. Natural Language Processing for Linguists
- NLP fundamentals
- Machine learning basics
- Deep learning for NLP
- Practical applications

Features:
- Free and open access
- Designed for linguists and humanists
- Hands-on exercises
- Jupyter notebooks
- Self-paced learning

Cultural Analytics with Python

Introduction to Cultural Analytics & Python by Melanie Walsh is an outstanding textbook for humanities and social sciences.

Topics Covered:
- Python programming basics
- Text analysis and natural language processing
- Social media analysis
- Network analysis
- Mapping and spatial analysis
- Web scraping
- Data visualization

Why It’s Excellent:
- Written specifically for humanities scholars
- Clear explanations
- Lots of examples
- Engaging datasets
- Free and online
- Continuously updated

Perfect For:
- Digital humanities students
- Social science researchers
- Anyone wanting practical skills

Programming Historian

The Programming Historian is a peer-reviewed, collaborative blog with lessons for digital humanists.

Lesson Categories:
- Data management
- Data manipulation
- Distant reading
- Getting ready to program
- Linked open data
- Mapping and GIS
- Network analysis
- Digital publishing
- Web scraping
- Data visualization

Languages:
- Python
- R
- JavaScript
- And more

Available In:
- English
- Spanish
- French
- Portuguese

Why It’s Trusted:
- Peer-reviewed lessons
- High quality control
- Practical focus
- Clear tutorials
- Open access

Corpus Linguistics Courses

Lancaster Summer Schools in Corpus Linguistics
- Lancaster University
- Annual intensive courses
- Beginner to advanced
- Hands-on training
- Expert instructors

Corpus Linguistics: Method, Analysis, Interpretation (FutureLearn)
- Free online course
- Lancaster University
- Six-week course
- Flexible learning

YouTube Channels:

  • Laurence Anthony - AntConc tutorials
  • Linguistics with a Corpus - Various corpus methods
  • Data Science Dojo - Text mining and NLP

R and Statistics

Quick-R
- statmethods.net
- By DataCamp’s Rob Kabacoff
- R code snippets
- Data management
- Statistics
- Data visualization
- Great for quick reference

STHDA (Statistical Tools for High-Throughput Data Analysis)
- sthda.com
- Comprehensive R tutorials
- Statistical methods
- Data visualization with ggplot2
- Machine learning

R for Data Science
- r4ds.had.co.nz
- By Hadley Wickham & Garrett Grolemund
- Modern data science workflow
- Tidyverse ecosystem
- Free online book

Advanced R
- adv-r.hadley.nz
- By Hadley Wickham
- Deep dive into R programming
- For experienced R users

Text Mining with R
- tidytextmining.com
- By Julia Silge & David Robinson
- Tidy approach to text analysis
- Practical examples

Quantitative Text Analysis in R
- quanteda tutorials
- Official quanteda documentation
- Comprehensive corpus analysis
- Step-by-step guides

Specialized Training

GLAM Workbench

GLAM Workbench provides tools and tutorials for working with data from galleries, libraries, archives, and museums.

Focus:
- Australia and New Zealand (expanding)
- Cultural heritage data
- Jupyter notebooks
- Interactive learning

Collections:
- National Library of Australia
- Trove
- State libraries
- Archives
- Museums

Features:
- Run live on Binder: Click and go—no installation!
- Real data from GLAM institutions
- Reproducible research
- Community contributions welcome

TAPoR 3: Text Analysis Portal

TAPoR 3 is a curated directory of research tools for text analysis.

What TAPoR Offers:
- 1,500+ tools listed
- Categories and tags
- Tool descriptions and reviews
- Comparison features
- Community curation

Tool Categories:
- Analyze
- Interpret
- Visualize
- Manipulate
- Create
- Explore

Search by:
- Analysis type
- Platform
- Programming language
- Cost
- Discipline

Perfect For:
- Discovering new tools
- Comparing options
- Finding specialized tools
- Staying updated


Research Centers and Labs

Leading Corpus Linguistics Centers

Lancaster University Corpus Linguistics

UCREL (University Centre for Computer Corpus Research on Language)
- ucrel.lancs.ac.uk
- World-leading corpus linguistics research
- Developers of CLAWS, USAS, Wmatrix
- British National Corpus
- Training and resources

CASS (Corpus Approaches to Social Science)
- Applying corpus methods to social science
- Discourse analysis
- Critical discourse analysis

VARIENG: Helsinki Corpus Research

VARIENG - Research Unit for Variation, Contacts and Change in English

Focus Areas:
- Language variation
- Language contact
- Language change
- Corpus compilation
- Sociolinguistics

Major Corpora:
- Helsinki Corpus of English Texts
- Corpus of Early English Correspondence
- Parsed corpora

Sydney Corpus Lab

Sydney Corpus Lab promotes corpus linguistics in Australia.

Mission:
- Build research capacity
- Connect Australian corpus linguists
- Promote method across disciplines
- Virtual lab fostering collaboration

Activities:
- Workshops and training
- Research collaboration
- Resource sharing
- Community building

Text Crunching Centre (TCC)

Text Crunching Centre at University of Zurich provides NLP expertise as a service.

Services:
- NLP consulting
- Custom tool development
- Text processing pipelines
- Named entity recognition
- Sentiment analysis
- Topic modeling

For:
- UZH researchers
- External partners
- Commercial clients

Language Acquisition and Variation

AcqVA Aurora Lab

AcqVA Aurora Lab at UiT The Arctic University of Norway

Research Areas:
- Language acquisition
- Language variation
- Language attrition
- Psycholinguistics

Support:
- Methodological consultation
- Data collection facilities
- Analysis support

Digital Humanities Centers

Stanford Literary Lab

Stanford Literary Lab
- Computational literary studies
- Quantitative methods
- Research publications
- Innovative approaches

Nebraska Literary Lab

Nebraska Literary Lab
- Digital humanities research
- Text analysis projects
- Teaching resources

King’s Digital Lab

King’s Digital Lab
- Software development
- Digital humanities projects
- Research infrastructure
- Training and support

Media and Communication Research

Media Research Methods Lab

Media Research Methods Lab at Leibniz Institute for Media Research

Focus:
- Computational social science
- Social media analysis
- Automated content analysis
- Network analysis

Methods:
- Survey research
- Experiments
- Content analysis
- Log data analysis
- Experience sampling

National Centre for Text Mining (NaCTeM)

NaCTeM - UK’s first publicly-funded text mining centre

Services:
- Text mining tools
- Literature-based discovery
- Biological database curation
- Clinical informatics

Resources:
- Software tools
- Training materials
- Publications


Corpora and Datasets

Major English Corpora

British National Corpus (BNC)

BNC
- 100 million words
- 1980s-1990s British English
- Written and spoken
- POS tagged
- Gold standard reference corpus

BNC2014
- 100 million words
- 2010s British English
- Conversation focus
- Comparable to original BNC

Corpus of Contemporary American English (COCA)

COCA
- 1 billion words (1990-2019)
- Balanced: spoken, fiction, popular magazines, newspapers, academic
- Updated regularly
- Genre and time filtering
- Free access

International Corpus of English (ICE)

ICE
- Parallel corpora from 20+ countries
- 1 million words per variety
- Spoken and written
- Comparable structure
- Grammatically annotated

ICE Components Include:
- ICE-GB (Britain)
- ICE-USA (United States)
- ICE-Ireland
- ICE-Canada
- ICE-India
- ICE-Hong Kong
- And many more

Google Books Ngram Corpus

Google Books Ngram Viewer
- Trillions of words
- 1500-2019
- Multiple languages
- Phrase frequency over time
- Great for diachronic studies

Dataset Available:
- Download for offline analysis
- Excellent for language change research

Historical Corpora

Corpus of Historical American English (COHA)

COHA
- 400 million words
- 1820s-2000s
- Balanced by decade
- Fiction, magazines, newspapers, non-fiction
- Track language change

Early English Books Online (EEBO)

EEBO
- 1473-1700
- 25,000+ texts
- English printing beginnings
- Critical for historical linguistics

Corpus of English Dialogues (CED)

Compiled at University of Helsinki
- 1560-1760
- Trial proceedings
- Drama
- Didactic works
- Real/simulated conversation

Specialized Corpora

MICASE (Michigan Corpus of Academic Spoken English)

MICASE
- 1.8 million words
- Academic speech
- Lectures, seminars, study groups
- Searchable online

VOICE (Vienna-Oxford International Corpus of English)

VOICE
- 1 million words
- English as a Lingua Franca
- Speakers from 50+ first languages
- Face-to-face interaction

CHILDES (Child Language Data Exchange System)

CHILDES
- Child language acquisition data
- 50+ languages
- Longitudinal corpora
- Transcription standards
- Analysis tools (CLAN)

Learner Corpora

ICLE (International Corpus of Learner English)
- University-level learners
- Multiple L1 backgrounds
- Academic essays

EF-Cambridge Open Language Database
- EFCAMDAT
- Large-scale learner corpus
- 70 million words
- 150+ countries

Multilingual Resources

Universal Dependencies

Universal Dependencies
- 100+ languages
- Syntactically annotated treebanks
- Cross-linguistically consistent
- Open source
- Regular releases

Leipzig Corpora Collection

Corpora.uni-leipzig.de
- 136 languages
- Web-crawled texts
- Various sizes
- Freely downloadable

Parallel Corpora

OPUS (Open Parallel Corpus)
- opus.nlpl.eu
- Translation memory corpora
- 90+ languages
- Movie subtitles, OpenSubtitles
- Bible translations
- EU documents
- Free download

Europarl
- European Parliament proceedings
- 21 languages
- Sentence-aligned
- Large-scale

Social Media and Web Corpora

Twitter Datasets

Availability:
- Twitter API (now X API)
- Academic research access
- Historical archives

Tools:
- Twarc: Twitter archiving
- Tweepy: Python library
- rtweet: R package

Reddit Datasets

Pushshift Reddit Dataset
- Comments and submissions
- 2005-present
- Searchable
- JSON format

Common Crawl

Common Crawl
- Petabytes of web data
- Monthly snapshots since 2008
- Free access
- Requires processing


Blogs and Communities

Corpus Linguistics Blogs

Around the Word

Hypotheses - Guillaume Desagulier’s corpus linguistics notebook

Topics:
- Usage-based corpus linguistics
- R for corpus analysis
- Cognitive linguistics
- Construction grammar
- Practical methods

Features:
- Code examples
- Methodological reflections
- Experimental approaches

Linguistics with a Corpus

Linguistics with a Corpus

Companion blog to Doing Linguistics with a Corpus by Egbert, Larsson & Biber

Topics:
- Corpus methodology
- Research design
- Statistical issues
- Best practices

Community:
- Open discussion
- Reader questions
- Methodological debates

Corpora List

Corpora Mailing List
- Long-running discussion list
- Corpus linguistics community
- Job postings
- Conference announcements
- Tool discussions
- Corpus releases

Data Science and NLP Blogs

Aneesha Bakharia

Aneesha Bakharia on Medium

Topics:
- Data science
- Topic modeling
- Deep learning
- Learning analytics
- Practical tutorials

Digital Observatory Blog

Digital Observatory at Queensland University of Technology

Content:
- Social media data collection
- Workshops and training
- Tool updates
- Open office hours
- Research methods

Probably Overthinking It

Allen Downey’s Blog
- Bayesian statistics
- Data science
- Clear explanations
- Python examples

Simply Statistics

simplystatistics.org
- Statistics and data science
- By academic leaders (Jeff Leek, Roger Peng, Rafa Irizarry)
- Thoughtful commentary

Digital Humanities

Digital Humanities Now

digitalhumanitiesnow.org
- News aggregator
- DH community
- Project highlights
- Curated content

DHd Blog

DHd-Blog
- German DH community
- Multilingual posts
- Method discussions
- Project showcases


Awards and Showcases

Digital Humanities Awards

Digital Humanities Awards - Annual recognition of excellent DH work

Categories:
- Best Exploration of DH Failure/Limitations
- Best DH Data Visualization
- Best Use of DH for Fun
- Best DH Dataset
- Best DH Short Publication
- Best DH Tool or Suite of Tools
- Best DH Training Materials
- Special categories (vary yearly)

Why Follow:
- Discover cutting-edge projects
- Find high-quality resources
- See what the community values
- Get inspired

Platforms for Showcasing Work

Humanities Commons

hcommons.org
- Academic social network
- Share research
- Discover projects
- Collaborate

DHCommons

Project registry and showcase
- Find collaborators
- Share projects
- Get feedback


Additional Resources

Visualization Tools

R Visualization

ggplot2
- ggplot2.tidyverse.org
- Grammar of Graphics
- Publication-quality plots
- Highly customizable

plotly
- plotly.com/r
- Interactive graphs
- Web-ready
- R and Python

Shiny
- shiny.rstudio.com
- Interactive web apps
- R-based
- No web dev needed

Python Visualization

matplotlib
- Foundation library
- Highly customizable
- Publication quality

seaborn
- Statistical visualization
- Beautiful defaults
- Built on matplotlib

altair
- Declarative visualization
- Interactive
- Grammar of graphics

Network Visualization

Gephi
- gephi.org
- Interactive network visualization
- Large graph support
- Open source

Cytoscape
- cytoscape.org
- Network analysis
- Biological focus
- Plugin ecosystem

Annotation Tools

WebAnno

webanno.github.io
- Web-based annotation
- Multiple users
- Various annotation types
- Inter-annotator agreement

brat

brat.nlplab.org
- Browser-based annotation tool
- Linguistic annotation
- Entity relationships
- Coreference chains

Prodigy

prodi.gy
- Modern annotation tool
- Active learning
- Quick setup
- Commercial (with educational discount)

INCEpTION

inception-project.github.io
- Semantic annotation
- Knowledge base integration
- Recommendation
- Open source

Data Management

Version Control

Git and GitHub
- github.com
- Code versioning
- Collaboration
- Open science

GitLab
- gitlab.com
- Alternative to GitHub
- Private repositories
- CI/CD integration

Research Data Management

OSF (Open Science Framework)
- osf.io
- Project management
- Preregistration
- DOI minting
- Version control

Zenodo
- zenodo.org
- Long-term data preservation
- DOI for datasets
- GitHub integration
- Free storage

Dataverse
- dataverse.org
- Data repository network
- Multiple institutions
- Data citation

Documentation

Read the Docs
- readthedocs.org
- Documentation hosting
- Sphinx integration
- Versioning

Jupyter Book
- jupyterbook.org
- Book from notebooks
- Beautiful online books
- Open source

Text Analysis Frameworks

Python

scikit-learn
- Machine learning
- Text classification
- Clustering
- Feature extraction

Gensim
- Topic modeling
- Word embeddings
- Document similarity
- Large corpus efficiency

TextBlob
- Simple NLP
- Sentiment analysis
- Translation
- Great for beginners

R Frameworks

quanteda
- Comprehensive text analysis
- Fast and efficient
- Well-documented
- Our recommended choice

tidytext
- Tidy data principles
- Integrates with dplyr/ggplot2
- Text mining
- Sentiment analysis

tm (text mining)
- Established package
- Document-term matrices
- Preprocessing
- Compatible with many tools

Machine Learning Platforms

Hugging Face

huggingface.co
- Transformers library
- Pre-trained models
- Datasets
- Model Hub
- Community-driven

Popular Models:
- BERT, GPT, RoBERTa
- T5, BART, XLM
- Language-specific models

TensorFlow and Keras

tensorflow.org
- Deep learning framework
- Keras high-level API
- Production deployment
- Large community

PyTorch

pytorch.org
- Deep learning framework
- Research-friendly
- Dynamic computation graphs
- Growing adoption


Getting Started Guide

For Complete Beginners

Your Learning Path

Week 1-2: Explore without installing
1. Try Voyant Tools
2. Search COCA
3. Watch AntConc tutorials

Week 3-4: Install first tools
1. Download AntConc
2. Follow a simple tutorial
3. Analyze your own texts

Month 2: Learn programming basics
1. Start Introduction to Cultural Analytics
2. OR Applied Language Technology
3. Practice with small projects

Month 3+: Specialize
1. Choose your focus (corpus linguistics, NLP, digital humanities)
2. Join relevant communities
3. Start a research project
4. Share your work

For Researchers

Setting Up Your Infrastructure:
1. Version control: Create GitHub account
2. Coding environment: Install R/Python/both
3. Editor: RStudio and/or Jupyter
4. Organization: Adopt project structure
5. Documentation: Start research notebook

Essential Skills to Develop:
- Regular expressions
- Data cleaning and preprocessing
- Statistical basics
- Visualization
- Reproducible research practices

Building Your Toolkit:
- Start with GUI tools (AntConc)
- Graduate to programming (R/Python)
- Master one language well
- Learn complementary tools as needed

For Teachers

Resources for Teaching:

Beginner-Friendly:
- Voyant Tools (no installation)
- AntConc (easy to use)
- Online concordancers
- GLAM Workbench (interactive notebooks)

Intermediate:
- Programming Historian lessons
- Cultural Analytics textbook
- Applied Language Technology course
- LADAL tutorials

Advanced:
- Research projects
- GitHub for collaboration
- Publication-ready analysis

Teaching Tips:
- Start with questions, not tools
- Use data students care about
- Scaffold from GUI to code
- Emphasize reproducibility
- Encourage sharing


Contributing to LADAL

Want to contribute a tutorial to LADAL? We welcome new authors!

Style Guide

Getting Started

Download Templates:
- Tutorial template
- Bibliography file

Required Elements:
1. Clear learning objectives
2. Installation/setup instructions
3. Worked examples with code
4. Exercises with solutions
5. Session info for reproducibility

Formatting Standards

Headers:
- Level 1: Numbered automatically
- Lower levels: Add {-} to suppress numbering

Code Style:
- Function and package names in backticks: function()
- Consistent spacing and indentation
- Comments explaining complex operations
- Use library() not require()

Emphasis:
- Use italics for emphasis
- Bold sparingly
- code font for technical terms

Tables:
- Use flextable for display
- Clear captions
- Appropriate width settings

Exercises:
Use the standard exercise format:

***  
  
<div class="warning">  
  <p style='margin-top:1em; text-align:center'>  
    <b>EXERCISE TIME!</b>  
  </p>  
</div>  
  
<div class="question">  
  
1. Exercise question here  
  
<details>  
  <summary>Answer</summary>  
    
\`\`\`{r}  
# Solution code  
\`\`\`  
  
</details>  
  
</div>  
  
***  

Content Guidelines

Clarity:
- Explain concepts before technical details
- Use concrete examples
- Define jargon on first use
- Anticipate confusion

Pedagogy:
- Build progressively
- Provide context and motivation
- Multiple examples at different levels
- Practice opportunities throughout

Reproducibility:
- Complete code provided
- Package versions documented
- Data sources cited
- sessionInfo() at end

Accessibility:
- Assume minimal prior knowledge
- Clear instructions
- Multiple learning paths
- Supportive tone

Citation Format

At the end of your tutorial:

Your last name, your first name. 2025. *The title of your tutorial*.   
Your location: your affiliation (if you have one).   
url: https://slcladal.github.io/shorttitleofyourtutorial.html   
(Version 2025.MM.DD).  

BibTeX entry:

@manual{yourlastname2025topic,  
  author = {YourLastName, YourFirstName},  
  title = {The Title of Your Tutorial},  
  note = {https://slcladal.github.io/shorttitleofyourtutorial.html},  
  year = {2025},  
  organization = {Your Affiliation},  
  address = {Your Location},  
  edition = {2025.MM.DD}  
}  

Session Info:

sessionInfo()  

Submission Process

Steps:
1. Develop your tutorial following the template
2. Test all code thoroughly
3. Proofread carefully
4. Email to LADAL team with:
- Your Rmd file
- Any required data files
- Brief description
- Your bio (optional)

We’ll:
1. Review for quality and fit
2. Suggest any revisions
3. Integrate into LADAL site
4. Credit you as author
5. Promote your tutorial

Questions?
Contact us through the LADAL website or email.


Stay Updated

Mailing Lists

Corpora List: Subscribe
- Corpus linguistics community
- Tool announcements
- Conference info
- Job postings

Humanist Discussion Group: Subscribe
- Digital humanities
- Long-running (since 1987!)
- Thoughtful discussions

Social Media

Twitter/X Hashtags:
- #CorpusLinguistics
- #DigitalHumanities
- #TextAnalytics
- #NLProc
- #Rstats (for R users)
- #Python (for Python users)

Follow:
- (corpusling?)
- (DHNow?)
- (rstudio?)
- (ProjectJupyter?)

LinkedIn Groups:
- Digital Humanities
- Text Analytics
- R Users

Conferences

Corpus Linguistics:
- ICAME (International Computer Archive of Modern and Medieval English)
- Corpus Linguistics (biennial)
- CLUK (Corpus Linguistics in the UK)

Digital Humanities:
- DH (Digital Humanities conference)
- TEI (Text Encoding Initiative)
- ADHO (Alliance of Digital Humanities Organizations)

NLP/Computational Linguistics:
- ACL (Association for Computational Linguistics)
- NAACL (North American Chapter of ACL)
- EMNLP (Empirical Methods in NLP)
- COLING (International Conference on Computational Linguistics)

Check:
- linguistic-conferences.org
- dh-abstracts.library.cmu.edu


Conclusion

This resource page brings together the best tools, tutorials, corpora, and communities for language technology and text analysis. Whether you’re just starting out or are an experienced researcher, these resources will support your work.

Next Steps
  1. Bookmark this page for easy reference
  2. Pick one resource that matches your current needs
  3. Join one community to connect with others
  4. Start a project applying what you learn
  5. Share your work and help others
  6. Contribute back by creating tutorials or tools

Remember:
- Start small and build gradually
- Focus on questions, not just tools
- Learn from the community
- Share your knowledge
- Reproducibility matters
- Have fun with your data!

The LADAL team is here to support you. Explore our tutorials, use our tools, and don’t hesitate to reach out with questions!



Back to top

Back to HOME


References

Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user. Cambridge University Press.

Anthony, L. (2004). AntConc: A learner and classroom friendly, multi-platform corpus analysis toolkit. IWLeL 2004: An Interactive Workshop on Language e-Learning, 7-13.

Baker, P. (2006). Using corpora in discourse analysis. Continuum.

McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press.

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media.

Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.

Walsh, M. (2021). Introduction to Cultural Analytics & Python. https://melaniewalsh.github.io/Intro-Cultural-Analytics/

Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.


Last Updated: 2026-02-08

Maintained by: The LADAL Team

Contribute: Know a great resource we missed? Let us know!