This page contains links to resources relevant for language technology, text analytics, data management and reproducibility, language data science and natural language processing.
LADAL is part of the Australian Text Analytics Platform (ATAP). The aim of ATAP is to provide researchers with a Notebook environment – in other words a tool set - that is more powerful and customisable than standard packages, while being accessible to a large number of researchers who do not have strong coding skills.
AntConc is a freeware corpus analysis toolkit for concordancing and text analysis developed by Laurence Anthony. In addition to AntConc, Laurence Anthony’s AntLab contains a conglomeration of extremely useful and very user-friendly software tools, that help and facilitate the analysis of textual data. Laurence has really developed an impressive, very user-friendly selection of tools that assist anyone interested in working with language data.
SMARTool is a corpus-based language learning and analysis tool for for English-speaking learners of Russian. It is linguist-built and thus informed by modern linguistic theory. SMARTool assists learners with coping with the rich Russian morphology and has user-friendly, corpus-based information and help for learning the most frequent forms of 3,000 basic vocabulary items.
Applied Language Technology is a website hosting learning materials for two courses taught at the University of Helsinki: Working with Text in Python and Natural Language Processing for Linguists. Together, these two courses provide an introduction to applied language technology for audiences who are unfamiliar with language technology and programming. The learning materials assume no previous knowledge of the Python programming language.
Introduction to Cultural Analytics & Python is a website established by Melanie Walsh that hosts an online textbook which offers an introduction to the programming language Python that is specifically designed for people interested in the humanities and social sciences.
The GLAM Workbench is a collection of tools, tutorials, examples, and hacks to help you work with data from galleries, libraries, archives, and museums (the GLAM sector). The primary focus is Australia and New Zealand, but new collections are being added all the time.
The resources in the GLAM Workbench are created and shared as Jupyter notebooks. Jupyter lets you combine narrative text and live code in an environment that encourages you to learn and explore. Jupyter notebooks run in your browser, and you can get started without installing any software!
One really great advantage of the GLAM workbench is that it is interactive: if you click on one of the links that says Run live on Binder, this will open the notebook, ready to use, in a customised computing environment using the Binder service.
The Programming Historian is a collaborative blog that contains lessons on various topics associated with computational humanities. The blog was founded in 2008 by William J. Turkel and Alan MacEachern. It focused heavily on the Python programming language and was published open access as a Network in Canadian History & Environment (NiCHE) Digital Infrastructure project. In 2012, Programming Historian expanded its editorial team and launched as an open access peer reviewed scholarly journal of methodology for digital historians.
TAPoR 3, the Text Analysis Portal for Research contains a unique, well-curated list of a wide variety of tools commonly used or widely respected groups of tools used by leading scholars in the various fields of Digital Humanities. TAPoR 3 was developed with support from the Text Mining the Novel Project. The tools mentioned in the list provided by TAPoR 3 represent both tried and trusted tools used by DH scholars, or new advancements that offer exciting new possibilities in their field. Curated lists are excellent places to start exploring TAPoR from, and can help with deciding what to include in your own lists.
Quick-R by DataCamp and maintained by Rob Kabacoff that contains tutorial and R code on R, data management, statistics, and data visualization. The site is extremely helpful and recommendable for everyone that looks for code snippets that can be adapted for your own use.
Statistical Tools For High-Throughput Data Analysis (STHDA) is a website containing various tutorials on the implementation of statistical methods in R. It is particularly useful due to the wide range of topics and procedures it covers.
At the Text Crunching Centre, a team of experts in Natural Language Processing supports your text technology needs. The TCC is part of the Department of Computational Linguistics at the University of Zurich and it is a service offered to all departments of the University of Zurich as well as to external partners or customers.
VARIENG stands for the Research Unit for the Study of Variation, Contacts and Change in English. It also stands for innovative thinking and team work in English corpus linguistics and the study of language variation and change. VARIENG members study the English language, its uses and users, both today and in the past.
The AcqVA Aurora Lab, is the lab of the UiT Aurora Center for Language Acquisition, Variation & Attrition at The Arctic University of Norway in Tromsø. Together with the Psycholinguistics of Language Representation (PoLaR) lab, the AcqVA Lab provides methodological support for researchers at the AcqVA Aurora Center.
The Sydney Corpus Lab aims to promote corpus linguistics in Australia. It’s a virtual, rather than a physical lab, and is an online platform for connecting computer-based linguists across the University of Sydney and beyond. Its mission is to build research capacity in corpus linguistics at the University of Sydney, to connect Australian corpus linguists, and to promote the method in Australia, both in linguistics and in other disciplines. We have strong links with the Sydney Centre for Language Research (Computational Approaches to Language node) and the Sydney Digital Humanities Research Group.
The Media Research Methods Lab (MRML) at the Leibniz Institute for Media Research │ Hans Bredow Institute (HBI) is designed as a method-oriented lab, which focuses on linking established social science methods (surveys, observations, content analysis, experiments) with new digital methods from the field of computational social science (e.g. automated content analysis, network analysis, log data analysis, experience sampling) across topics and disciplines.
The National Centre for Text Mining (NaCTeM), located in the UK, is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. On the NaCTeM website, you can find pointers to sources of information about text mining such as links to text mining services provided by NaCTeM, software tools (both those developed by the NaCTeM team and by other text mining groups), seminars, general events, conferences and workshops, tutorials and demonstrations, and text mining publications.
Hypotheses is a research blog by [Guillaume Desagulier](http://www2.univ-paris8.fr/desagulier/home/. In this blog, entitled Around the word, A corpus linguist’s notebook, Guillaume has recorded reflections and experiments on his practice as a usage-based corpus linguist and code for analyzing language using R.
Aneesha Bakharia is a blog by Aneesha Bakharia about Data Science, Topic Modeling, Deep Learning, Algorithm Usability and Interpretation, Learning Analytics, and Electronics. Aneesha began her career as an electronics engineer but quickly transitioned to an educational software developer. Aneesha has worked in the higher education and vocational educational sectors in a variety of technical, innovation and project management roles. In her most recent role before commencing at UQ, Aneesha was a project manager for a large OLT Teaching and Learning grant on Learning Analytics at QUT. Aneesha’s primary responsibilities at ITaLI include directing the design, development and implementation of learning analytics initiatives (including the Course Insights teacher facing dashboard) at UQ.
Linguistics with a Corpus is a companion blog to the Cambridge Element book Doing Linguistics with a Corpus. Methodological Considerations for the Everyday User (Doing Linguistics with a Corpus. Methodological Considerations for the Everyday User, n.d.). The blog is intended to be a forum for methodological discussion in corpus linguistics, and Jesse, Tove, and Doug very much welcome comments and thoughts by visitors and readers!
The blog of the Digital Observatory at the Queensland University of Technology (QUT) contains updates on resources, meetings, (open) office hours, workshops, and developments at the Digital Observatory which provides services to researchers such as retrieving and processing social media data or offering workshops on data processing, data visualization, and anything related to gathering and working with social media data.
BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:
BookNLP ships with two models, both with identical architectures but different underlying BERT sizes. The larger and more accurate big model is fit for GPUs and multi-core computers; the faster small model is more appropriate for personal computers.
The Digital Humanities Awards page contains a list of great DH resources for the following categories:
This blog entry introduces the Data Measurements Tool. The Data Measurements Tool is a Interactive Tool for Looking at Datasets. The Data Measurements Tool is a open-source Python library and no-code interface called the Data Measurements Tool, using our Dataset and Spaces Hubs paired with the great Streamlit tool. This can be used to help understand, build, curate, and compare datasets.
Doing Linguistics with a Corpus. Methodological Considerations for the Everyday User. n.d. cambridge: Cambridge University Press.