We are developing a package with a test for detection of differential distributions. It will offer generalized testing functions for any one-dimensional data that exists in two conditions, as well as a specialized statistical test for single cell RNAseq data. As development of all features is now finished we are trying to submit it to BioConductor. For me this would be the first time submitting to Bioconductor and I would greatly appreciate any help with our problem.
Problem
Included in the package data/ directory is an empirical cumulative distribution function of 1,000,000 values and 4,8M size that takes up the most space in the package and leads to Notes/Errors when running R CMD check ... / R CMD BiocCheck ... telling us the package is too big.
- Is there any tolerance with regard to the package size restrictions on BioConductor? (See below for the total size)
- Is it justifiable to submit a data package with this distribution to BioConductor on the basis that it is required by another BioC package, even though it contains no biological data-set? Searching for such existing packages among the BioConductor, I could only find cases where example data-sets (genomical/biological data only) were imported as separate packages.
Again, thanks for any advice you have to offer!
Details
A function included in the package determines p-values from the empirical quantile function of a distribution called the Brownian bridge. The quantile function has been calculated beforehand up to a high precision and is saved as the following function:
> empcdf.ref
Empirical CDF
Call: ecdf(value.integral)
x[1:1000000] = 0.0083841, 0.0088768, 0.0095009, ..., 2.7204, 3.012
In the current state of our package, the function has been stored as a .RData file to a data/ directory with the following command:
> save(empcdf.ref, file="data/empcdf_ref.RData", compress=TRUE, compression_level=9)
To ensure best compression we also tried the following commands:
> tools::checkRdaFiles("empcdf_ref.RData")
size ASCII compress version
empcdf_ref.RData 12337274 FALSE gzip 3
> tools::resaveRdaFiles("empcdf_ref.RData", compress ="auto")
> tools::checkRdaFiles("empcdf_ref.RData")
size ASCII compress version
empcdf_ref.RData 5009924 FALSE xz 3
So that finally the quantile function can be stored at a size of 4.8M:
$ du -h data/empcdf_ref.RData
4,8M data/empcdf_ref.RData
Upon running R CMD BiocCheck this shows as an Error
* Checking package size...
* ERROR: Package Source tarball exceeds Bioconductor size
requirement.
Package Size: 5.0324 MB
Size Requirement: 5.0000 MB
* Checking individual file sizes...
* WARNING: The following files are over 5MB in size:
'data/empcdf_ref.RData'
EDIT: BiocCheck Tag added and R CMD BiocCheck output
You might get a better response posting this to the Bioc Developers mailing list, which is more focused on issues like this. You can sign up at https://stat.ethz.ch/mailman/listinfo/bioc-devel
There was a post very recently on this same topic which may be useful: https://stat.ethz.ch/pipermail/bioc-devel/2019-July/015311.html
Thank you, Mike! I will reframe this question and try the mailing list.