Extending the loo Package

2025.05.08 | 673 words | 4 min

I’m excited to be accepted into the GSOC 2025 cohort, working on the loo R package! I’ll be mentored by Aki Vehtari, Jonah Gabry, and Noa Kallioinen. Apparently only 1272 applicants were accepted from 15,240 applicants (23,559 proposals), which is an acceptance rate of ~8.35% (~5.4%). This was the only application I submitted.

The loo package is widely used to cross-validate Bayesian models and has over three million downloads and several thousand citations to date. The focus of this project is to update the package to extend the API to support a wider array of predictive measures.

Gory, technical details, extracted from our project proposal, follow.

Expect more posts as the project starts proper.

PS: Getting a working bibliography and citations was annoying because the projects I found were old and didn’t work. I vibe coded (true vibe coding, pure verification no latent reasoning) my way from one of those to these shortcodes. Specify a bibliography (stored at data/$NAME.json) by adding bib: $NAME to your post’s YAML.

Technical Details

This project is focused on an overhaul of the existing loo API to admit new CV methods, metrics, and scores. There are a number of pertinent issues, viz. #281, #223, #220 #213, #201, #135, and #106.

First, a brief overview of loo and the goals and procedures of the loo package.

Cross-validation (CV) is a common means to estimate a model’s predictive accuracy, e.g. for model selection or stacking (Vehtari et al., 2017). Leave-one-out CV (LOO) is a CV structure where the model is fit to the data with one data point left out, for every point in the data. This process is computationally expensive, though, so approximate LOO can be used, and can be computed simply through importance sampling (IS) (Vehtari et al., 2017). However, the importance weights used for IS can can have very large or infinite variance (Vehtari et al., 2017), and so one can use Pareto smoothed importance sampling (PSIS) (Vehtari et al., 2024) to enjoy more stable LOO estimates and additionally have a simple diagnostic to determine if PSIS estimate is likely to have a small error.

Currently, loo returns an object, that we will extend to work with more measures; adding metrics such as MAE, RMSE, MSE, R2, ACC, balanced ACC, and Brie score, and scores such as RPS, SRPS, CRPS, SCRPS, and log score. We will need to create a flexible object which can report multiple metrics and scores. Additionally, we need to create functions to support model comparisons for all these scores and metrics. For all measures, we will return a loo object with a pointwise measures and estimates; the former allows us to quantify the uncertainty in model comparisons.

We will also allow for measures besides log score for LOO-CV to be used under the same, unified interface as existing options and create a consistent loo object for all measures. We will further unify the interface by allowing for non-log score measures for in-sample, test data, and K-fold-CV use cases. Measures will be stratified into scores and metrics since scores need draws from the predictive distribution but metrics need draws from a point estimate or the posterior. The interface will be cleanly split for these two broad cases. Additionally, we will spin out PSIS functions to differentiate when PSIS is being used as opposed to measures using in-sample, independent test data, or K-fold-CV. We will also be extending the model comparison functions to carry forward information on what measure is being compared, diagnostic data, and information on how to calculate the standard error (SE) of the differences of various measures.

Bibliography

Vehtari, Aki et al.(2017).Practical bayesian model evaluation using leave-one-out cross-validation and WAIC.Statistics and Computing, 27, 1413–1432.doi: 10.1007/s11222-016-9696-4.

Vehtari, Aki et al.(2024).Pareto smoothed importance sampling.Journal of Machine Learning Research, 25(72), 1–58.http://jmlr.org/papers/v25/19-556.html.

#gsoc

Reply to this post by email