Visruth Srimath Kandali

R, Statistics, Reproducibility

| 564 words | 3 min

Mostly just penning some ideas that have been floating around in my head. I can’t think of a semi-synonym for stats that starts with R to complete the trilateral alliteration which was saddening :(


Reproducibility is more of a hot topic with every passing year it seems (as I’ve ascertained through literature review, not experience), but I still feel like the discourse isn’t radical enough so to speak, especially in the Stats community. I didn’t learn about {renv} in class as it was deemed too advanced, and obviously pursuits into nix and NixOS weren’t endorsed by schooling at all. I don’t think nix is that hard, especially not if one had arranged materials and a guided lesson to go through instead of muddling through the wiki, dotfiles, and various guides. I’d argue that using {renv} is trivial–students don’t need to understand much beyond the workflow as we don’t really know how the tools we use work, anyway1. Students can use these tools and get used to the workflow without a deep understanding or really any mental model of what is going on. I use git all the time and I couldn’t explain how it truly works–I just know the basics of how to use git and this suffices for my applications (replace with your favourite VCS; jj, anyone?). I think you cannot truly know a thing until you can explain how it works basically from the ground up, but I also think that most people don’t need to know things. I don’t know R, I know how to use R. The distinction gets dropped in casual conversation, but I’m bringing it up here because I don’t think we (as a field, Statistics) need to necessarily understand reproducibility before blindly accepting and adhering to its tenets.

I think the advantages and necessity of reproducible environments/workflows/whathaveyou are clear, but the tooling around how to create these needs work. I am a proponent (surprise) of nix, which has guix as more friendly option? but I am painfully aware of how difficult it can be to get a project up and running using nix (on NixOS). In a stats context specifically, {renv} at least pins package and language dependencies to aid the pursuit of reproducibility, and though it isn’t nearly enough, it is far better than nothing. Using {renv} is far easier than nix and friends; a few lines gets you up and running. The workflow can be slotted neatly into your existing workflow without much issue. This workflow is already common in python workflows, and is made even better with uv, which makes working with python environments tenable. Statistics in R needs to adopt a similar posture towards at least pinning dependencies by way of {renv} or any other competitor which does the job better.

I think the real goal of reproducibility should be looking far further than passing an repro env to another person on a different platform or whatever in the current year; the goal should be able to have analyses which can be re-run many years, ideally decades, into the future. The goal is rather lofty, if not unattainable, but it is something we need to keep in mind and be working toward. See guix, SQLite for some brighter people talking about similar things.


  1. Ask any statistics student to explain NSE–or if they even know what NSE means (non standard evaluation, see Adv R). ↩︎

Reply to this post by email