Concepts in Reproducible Research

Author

Martin Schweinberger

Introduction

This tutorial introduces issues relating to reproducible research and shows how to make research and workflows more transparent and efficient. It is designed to provide essential guidance on transparency and reproducibility for individuals working with language data.

The objective of this tutorial is to discuss and differentiate key concepts in reproducible research and to outline simple strategies that improve the reproducibility of analytic workflows. By following this tutorial, users can begin to think about practices that ensure their workflows are transparent and their results reproducible, thereby increasing the reliability and credibility of their research.

This tutorial will also briefly introduce essential techniques such as version control with Git and effective documentation strategies. These methods are crucial for creating analyses that others can easily understand, verify, and build upon. By adopting these practices, you contribute to a more open and collaborative scientific community. For some more practical tips on reproducible research practices using R and RStudio, see our Reproducibility with R tutorial.

Basic Concepts in Reproducibility

This section introduces some basic concepts and provides useful tips for rendering your research more transparent and reproducible.

Reproducibility is a cornerstone of scientific inquiry, demanding that two empirical analyses yield consistent outcomes under equivalent conditions and with comparable populations under scrutiny (Gundersen 2021; Goodman, Fanelli, and Ioannidis 2016). Historically, the reproducibility of scientific findings was often assumed, but this assumption has been substantially challenged by the Replication Crisis (Moonesinghe, Khoury, and Janssens 2007; Simons 2014). The Replication Crisis, which has been extensively documented (Collaboration 2015; Ioannidis 2005), represents an ongoing methodological quandary stemming from the failure to reproduce critical medical studies and seminal experiments in social psychology during the late 1990s and early 2000s. By the early 2010s, the crisis had extended its reach to encompass segments of the social and life sciences (Anderson et al. 2016; Diener and Biswas-Diener 2019), significantly eroding public trust in the results of studies from the humanities and social sciences (McRae 2018; Yong 2018).

Below are definitions of terms relevant for distinguishing key concepts in discussions around reproducibility and transparency. This clarification is necessary to avoid misunderstandings stemming from the common conflation of similar yet different terms in this discourse.

Replication

Replication involves repeating a study’s procedure to determine if the prior findings can be reproduced. Unlike reproduction, which utilises the same data and method, replication entails applying a similar method to different but comparable data. The aim is to ascertain if the results of the original study are robust across new data from the same or similar populations. Replication serves to advance theory by subjecting existing understanding to new evidence (Nosek and Errington 2020; Moonesinghe, Khoury, and Janssens 2007).

Reproduction

In contrast, reproduction (or computational replication) entails repeating a study by applying the exact same method to the exact same data (this is what McEnery and Brezina (2022) refer to as repeatability). The results should ideally be identical or highly similar to those of the original study. Reproduction relies on reproducibility, which assumes that the original study’s authors have provided sufficient information for others to replicate the study. This concept often pertains to the computational aspects of research, such as version control of software tools and environments (Nosek and Errington 2020).

Robustness

Robustness refers to the stability of results when studies are replicated using different procedures on either the same or different yet similar data. While replication studies may yield different results from the original study due to the use of different data, robust results demonstrate consistency in the direction and size of effects across varying procedures (Nosek and Errington 2020).

Triangulation

Recognising the limitations of replication and reproducibility in addressing the issues highlighted by the Replication Crisis, researchers emphasise the importance of triangulation. Triangulation involves strategically employing multiple approaches to address a single research question, thereby enhancing the reliability and validity of findings (Munafò and Davey Smith 2018).

Practical versus theoretical/formal reproducibility

This distinction distinguishes between practical and theoretical or formal reproducibility (Schweinberger 2024). Practical reproducibility emphasises the provision of resources by authors to facilitate the replication of a study with minimal time and effort. These resources may include notebooks, code repositories, or detailed documentation, allowing for the reproducibility of studies across different computers and software environments (Grüning et al. 2018).

Transparency

Transparency in research entails clear and comprehensive reporting of research procedures, methods, data, and analytical techniques. It involves providing sufficient information about study design, data collection, analysis, and interpretation to enable others to understand and potentially replicate the study. Transparency is particularly relevant in qualitative and interpretive research in the social sciences, where data sharing may be limited due to ethical or copyright considerations (Moravcsik 2019).

We now move on to some practical tips and tricks on how to implement transparent and well-documented research practices.

Documentation Guidelines

Documentation involves meticulously recording your work so that others—or yourself at a later date—can easily understand what you did and how you did it. This practice is crucial for maintaining clarity and continuity in your projects. As a general rule, you should document your work with the assumption that you are instructing someone else on how to navigate your files and processes on your computer.

Be Clear and Concise: Write in a straightforward and concise manner. Avoid jargon and complex language to ensure that your documentation is accessible to a wide audience.
Include Context: Provide background information to help the reader understand the purpose and scope of the work. Explain why certain decisions were made.
Step-by-Step Instructions: Break down processes into clear, sequential steps. This makes it easier for someone to follow your workflow.
Use Consistent Formatting: Consistency in headings, fonts, and styles improves readability and helps readers quickly find the information they need.
Document Locations and Structures: Clearly describe where files are located and the structure of your directories. Include details on how to navigate through your file system.
Explain File Naming Conventions: Detail your file naming conventions so others can understand the logic behind your organisation and replicate it if necessary.
Update Regularly: Documentation should be a living document. Regularly update it to reflect changes and new developments in your project.

Example

If you were documenting a data analysis project, your documentation might include:

Project Overview: A brief summary of the project’s objectives, scope, and outcomes.
Directory Structure: An explanation of the folder organisation and the purpose of each directory.
Data Sources: Descriptions of where data is stored and how it can be accessed.
Processing Steps: Detailed steps on how data is processed, including code snippets and explanations.
Analysis Methods: An overview of the analytical methods used and the rationale behind their selection.
Results: A summary of the results obtained and where they can be found.
Version Control: Information on how the project is version-controlled, including links to repositories and branches.

By following these best practices, your documentation will be comprehensive and user-friendly, ensuring that anyone who needs to understand your work can do so efficiently. This level of detail not only aids in collaboration but also enhances the reproducibility and transparency of your projects.

Version control (Git)

Implementing version control systems, such as Git, helps track changes in code and data over time. The primary issue that version control applications address is the dependency of analyses on specific versions of software applications. What may have worked and produced a desired outcome with one version of a piece of software may no longer work with another version. Thus, keeping track of versions of software packages is crucial for sustainable reproducibility. Additionally, version control extends to tracking different versions of reports or analytic steps, particularly in collaborative settings (Blischak, Davenport, and Wilson 2016).

Version control facilitates collaboration by allowing researchers to revert to previous versions if necessary and provides an audit trail of the data processing, analysis, and reporting steps. It enhances transparency by capturing the evolution of the research project. Version control systems, such as Git, can be utilised to track code changes and facilitate collaboration (Blischak, Davenport, and Wilson 2016).

RStudio has built-in version control and also allows direct connection of projects to GitHub repositories. GitHub is a web-based platform and service that provides a collaborative environment for software development projects. It offers version control using Git, a distributed version control system, allowing developers to track changes to their code, collaborate with others, and manage projects efficiently. GitHub also provides features such as issue tracking, code review, and project management tools, making it a popular choice for both individual developers and teams working on software projects.

Uploading and sharing resources (such as notebooks, code, annotation schemes, additional reports, etc.) on repositories like GitHub (https://github.com/) (Beer 2018) ensures long-term preservation and accessibility, thereby ensuring that the research remains available for future analysis and verification. By openly sharing research materials on platforms like GitHub, researchers enable others to access and scrutinise their work, thereby promoting transparency and reproducibility.

Digital Object Identifier (DOI) and Persistent identifier (PiD)

Once you’ve completed your project, help make your research data discoverable, accessible and possibly re-usable using a PiD such as a DOI! A Digital Object Identifier (DOI) is a unique alphanumeric string assigned by either a publisher, organisation or agency that identifies content and provides a PERSISTENT link to its location on the internet, whether the object is digital or physical. It might look something like this http://dx.doi.org/10.4225/01/4F8E15A1B4D89.

DOIs are considered a type of persistent identifiers (PiDs). An identifier is any label used to name something uniquely (whether digital or physical). URLs are an example of an identifier. So are serial numbers, and personal names. A persistent identifier is guaranteed to be managed and kept up to date over a defined time period.

Journal publishers assign DOIs to electronic copies of individual articles. DOIs can also be assigned by an organisation, research institute or agency and are generally managed by the relevant organisation and relevant policies. DOIs not only uniquely identify research data collections, they also support citation and citation metrics.

A DOI will also be given to any data set published in UQ eSpace, whether added manually or uploaded from UQ RDM. For information on how to cite data, have a look here.

Key points

DOIs are a persistent identifier and as such carry expectations of curation, persistent access and rich metadata
DOIs can be created for DATA SETS and associated outputs (e.g. grey literature, workflows, algorithms, software etc) - DOIs for data are equivalent to DOIs for other scholarly publications
DOIs enable accurate data citation and bibliometrics (both metrics and altmetrics)
Resolvable DOIs provide easy online access to research data for discovery, attribution and reuse

GOING FURTHER

For Beginners

Ensure data you associate with a publication has a DOI- your library is the best group to talk to for this.

For Intermediates

Learn more about how your DOI can potentially increase your citation rates by watching this 4m:51s video
Learn more about how your DOI can potentially increase your citation rate by reading the ANDS Data Citation Guide

For Advanced identifiers

Learn more about PiDs and DOIs
Contact the Library for advice on how to obtain a DOI upon project completion.
Have a look at ANDS/ARDC - Citation and Identifiers
Check out the DOI system for research data

Citation & Session Info

Schweinberger, Martin. 2024. Concepts in Reproducible Research. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/repro.html (Version 2025.08.01).

@manual{schweinberger2025repro,
  author = {Schweinberger, Martin},
  title = {Concepts in Reproducible Research},
  note = {tutorials/repro/repro.html},
  year = {2025},
  organization = {The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2025.08.01}
}

sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS 15.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gapminder_1.0.0  lubridate_1.9.4  forcats_1.0.0    stringr_1.5.1   
 [5] dplyr_1.1.4      purrr_1.0.2      readr_2.1.5      tibble_3.2.1    
 [9] ggplot2_3.5.1    tidyverse_2.0.0  tidyr_1.3.1      here_1.0.1      
[13] DT_0.33          kableExtra_1.4.0 knitr_1.49      

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      jsonlite_1.8.9    compiler_4.4.0    tidyselect_1.2.1 
 [5] xml2_1.3.6        systemfonts_1.2.1 scales_1.3.0      yaml_2.3.10      
 [9] fastmap_1.2.0     R6_2.5.1          generics_0.1.3    htmlwidgets_1.6.4
[13] munsell_0.5.1     rprojroot_2.0.4   svglite_2.1.3     tzdb_0.4.0       
[17] pillar_1.10.1     rlang_1.1.5       stringi_1.8.4     xfun_0.50        
[21] timechange_0.3.0  viridisLite_0.4.2 cli_3.6.3         withr_3.0.2      
[25] magrittr_2.0.3    digest_0.6.37     grid_4.4.0        rstudioapi_0.17.1
[29] hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.3   
[33] glue_1.8.0        codetools_0.2-20  colorspace_2.1-1  rmarkdown_2.29   
[37] tools_4.4.0       pkgconfig_2.0.3   htmltools_0.5.8.1

Back to HOME

References

Anderson, C. J., S. Bahnik, M. Barnett-Cowan, F. A. Bosco, J. Chandler, C. R. Chartier, and N. Della Penna. 2016. “Response to Comment on "Estimating the Reproducibility of Psychological Science".” Science 351 (6277): 1037. https://doi.org/10.1126/science.aad9163.

Beer, Brent. 2018. Introducing GitHub: A Non-Technical Guide. O’Reilly.

Blischak, John D, Emily R Davenport, and Greg Wilson. 2016. “A Quick Introduction to Version Control with Git and GitHub.” PLoS Computational Biology 12 (1): e1004668.

Collaboration, Open Science. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.

Diener, Edward, and Robert Biswas-Diener. 2019. “The Replication Crisis in Psychology.” https://nobaproject.com/modules/the-replication-crisis-in-psychology.

Goodman, S. N., D. Fanelli, and J. P. Ioannidis. 2016. “What Does Research Reproducibility Mean?” Science Translational Medicine 8 (341): 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027.

Grüning, Björn, John Chilton, Johannes Köster, Ryan Dale, Nicola Soranzo, Marius van den Beek,..., and Jan Taylor. 2018. “Practical Computational Reproducibility in the Life Sciences.” Cell Systems 6 (6): 631–35. https://doi.org/10.1016/j.cels.2018.03.014.

Gundersen, O. E. 2021. “The Fundamental Principles of Reproducibility.” Philosophical Transactions of the Royal Society A 379: 20200210. https://doi.org/10.1098/rsta.2020.0210.

Ioannidis, J. P. A. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8): e124. https://doi.org/10.1371/journal.pmed.0020124.

McEnery, Tony, and Vaclav Brezina. 2022. Fundamental Principles of Corpus Linguistics. Cambridge University Press.

McRae, Mike. 2018. “Science’s ’Replication Crisis’ Has Reached Even the Most Respectable Journals, Report Shows.” https://www.sciencealert.com/replication-results-reproducibility-crisis-science-nature-journals.

Moonesinghe, Ramal, Muin J. Khoury, and A. Cecile J. W. Janssens. 2007. “Most Published Research Findings Are False—but a Little Replication Goes a Long Way.” PLoS Medicine 4 (2): e28. https://doi.org/10.1371/journal.pmed.0040028.

Moravcsik, Andrew. 2019. “Transparency in Qualitative Research.” In SAGE Research Methods Foundations. https://doi.org/10.4135/9781526421036.

Munafò, Marcus R., and George Davey Smith. 2018. “Robust Research Needs Many Lines of Evidence.” Nature 553 (7689): 399–401. https://doi.org/10.1038/d41586-018-01023-3.

Nosek, Brian A., and Timothy M. Errington. 2020. “What Is Replication?” PLoS Biology 18 (3): e3000691. https://doi.org/10.1371/journal.pbio.3000691.

Schweinberger, Martin. 2024. “Implications of the Replication Crisis for Corpus Linguistics – Some Suggestions to Improve Reproducibility.” In Broadening Horizons: Data-Intensive Approaches to English, edited by Mikko Laitinen and Paula Rautionaho. Cambridge University Press.

Simons, D. J. 2014. “The Value of Direct Replication.” Perspectives on Psychological Science 9 (1): 76–80. https://doi.org/10.1177/1745691613514755.

Yong, Ed. 2018. “Psychology’s Replication Crisis Is Running Out of Excuses. Another Big Project Has Found That Only Half of Studies Can Be Repeated. And This Time, the Usual Explanations Fall Flat.” https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/.