Package archive for the purpose of reproducible research
1
0
Entering edit mode
@robertbruccoleri-9367
Last seen 9.1 years ago
United States

I am helping a group of statistical and genetic analysts maintain an R environment for their work. One key capability that is required is the ability to reproduce previous computations. The R packages that were used for an analysis are typically recorded.

For CRAN, it's easy to find the tar-ball for any published version of any package, but that is not the case of Bioconductor. The directories that contain tar-balls for versions between releases are not indexed, so the only way to find a specific version is to try the URL, which is really inefficient and not practical.

Would the Bioconductor developers please consider making all published tar-balls available in some easy to find way? I understand the need for having a consistent set of packages so incompatibilities are avoided, but the current situation makes it very difficult for those of us facing an auditor who wants us to reproduce a five year old computation.

One suggestion that I could offer would be to put the tarballs archive behind a user account system (as is used by this forum for posting messages) and have the user see a clear statement that the tarballs outside of releases are not supported or recommended in any way.

Thanks.

archive reproducibility • 1.6k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 22 hours ago
United States

This is a relatively regular request (most recently on Bioc-devel), with different possible solutions. Tar balls represent one solution. Another is to tag package versions in source code repositories. There are several additional schemes to save R archives, including packrat and switchr. This will eventually be addressed, but is not at the top of our priority list.

Generally, the 'final' version of a package for a Bioconductor release is available at a url with path /packages/<bioc-version>/<package-name>, e.g., https://bioconductor.org/packages/2.14/IRanges. Each release is a CRAN-style repository, so the content of the release can be discovered at http://bioconductor.org/packages/2.14/bioc/src/contrib/PACKAGES, e.g., using read.dcf(url()). And using biocLite() as documented on each package landing page with the appropriate version of R (and BIocInstaller, for more recent releases) consults this repository.

The tar.gz archive is unlikely to be sufficient for reproducibility five years from now -- compilers and operating system dependencies will have changed by then. This implies that the user has maintained this part of the infrastructure, and that generally reproducibility is really in the hands of the user and their specific systems -- having tarballs available will be of little value.

Exact reproducitbility is a starting point for validating an analysis, but the scientific case for reproducing an incorrect analysis [reflected in .z versions other than the final x.y.z within a Bioconductor release] is not particularly compelling.

I opened a github issue for this on our 'BBS' (Bioconductor build system) software; you and others are welcome to augment the issue with specific suggestions (and of course 'patches welcome', though that would be challenging in the present case).

ADD COMMENT
1
Entering edit mode

The value of having specific versions available (as reported by sessionInfo) should not be discounted.

In the face of a data quality audit where a prior result is failing to reproduce, being able to eliminate version x.y.z differences as a source of the failure to reproduce would be very compelling and valuable!

Going forward, a simple solution would be to just save the tar-balls in a CRAN like archive.

ADD REPLY

Login before adding your answer.

Traffic: 856 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6