I'm working in collaboration with a clinical diagnostics / research team in New Zealand, trying to use RNA biomarkers to determine disease risk. What we would ideally like to do is to generate expression ranges (in log space) using the reference samples, and then use those ranges to generate individual likelihoods per biomarker that are combined to generate a disease risk score. The original expectation was that this would be using the negative binomial distribution, but I can't find any documentation about how to estimate Nbinom parameters from a set of data (i.e. in a similar way that mean and sd can be used for normally-distributed data). Does anyone know of any ways to do this?
In light of this difficulty, I've been playing round with the 'boot' package to see if I can do bootstrap sub-sampling to estimate the expression distribution. One of our concerns is whether or not 6 samples is enough to be able to do this (that's what we've got at the moment, but we will have more in the future). Another concern is whether or not the approach I'm using is correct -- it's not something I've had much experience with previously, so I'm flailing around a bit. I'm currently subsampling reference samples with replacement, then calculating probability thresholds based on the distribution produced from the subsampling:
## extract out size-normalised counts for reference set ## [dgd.full is a DESeq2 dataset] normcounts.ref.mat <- counts(dgd.full, normalized=TRUE)[,dgd.full$reference == "Ref"]; ## subsample individuals (with replacement) to generate per-marker stats ref.boot <- boot(data=t(normcounts.ref.mat), statistic = function(data,indices){ colMeans(log1p(data[indices,])); }, R = 10000); ## extract out descriptive statistics ref.stats <- data.frame(row.names=rownames(normcounts.ref.mat)); for(stat in c("median","mean","sd","min","max")){ ref.stats[[sprintf("LogT.%s",stat)]] <- apply(ref.boot$t,2,match.fun(stat)); } for(prob in c(0.005,0.025,0.15,0.5,0.85,0.975,0.995)){ ref.stats[[sprintf("Q%s",prob)]] <- round(expm1(apply(ref.boot$t,2,quantile,probs=prob)),0); } ### this can be done with boot.ci as well: ## a <- boot.ci(ref.boot,index=1, type="perc", hinv=expm1, conf=c(0.7,0.95,0.99))
Does this method make sense? We're doing both subsampling and a DESeq2 DE analysis so that we can compare methods, and hopefully get an idea of which approach works best for our disease.
Cheers,
--
David Eccles
Bioinformatics Research Analyst, Gringene Bioinformatics
Room 2.10 x857
Malaghan Institute of Medical Research