Dear Expert(s)
(Sorry if this information is provided elsewhere- i have looked high and low with no success)
I have a RNASeq data set of 22 samples that can be compared to one another in a number of different combinations. Some of the samples are essentially treatment subsets of others. I have used DESeq2 on the countdata ( along with associated colData) to generate both rld and dds (and res) objects. For some samples there are (biological) duplicates for others quadruplicates.
I can generate the DE results for various sample combinations using contrasts but what I also want is a list of genes that are 'expressed' in one set of samples but (for example) not another. I realise what is 'expressed' depends on how that is defined and what cutoff is used but my questions are
1) is it valid to use the rlog transformed matrix (assay(rld)) for this purpose? I am concerned that this data is subject to normalization based on the experimental design input and so would differ with different designs.
If no which package would be recommended to obtain such data from count matrix?
2) If yes to above: are there any error or p-values associated with this data? (There appears to be such data in the DESeqDataSet buty I cant work out how to get at it). The intention being I would be used as a cutoff for expression values (lfc) for which error is too high.
3) is it valid to use the mean of replicates to obtain a value for a particular cell type
4) I have tried to use the rlog function on the countdata matrix which the manual suggests you can but I get an error (Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘sizeFactors’ for signature ‘"data.frame"’).
Thanks
Sue
Thanks for your very helpful reply. I have 22 samples containing 3-5 distinct cell types- I would like to use this approach to help tell me how distinct. While I agree this is nothing compared to the Barcode dataset I still feel that there is value in coming up with gene lists that are considered on in one cell type and off in another.This sort of thing can be and is being done with FPKM data and I would like to do the same without having to reprocess my data
I look forward to the Barcode data based on RNASeq data- that may help in deciding on/off thresholding.
One more question that has come up when I raised this issue with colleagues: given how the rlog transformation works is it reasonable (assuming all caveats discussed above) to have the same expressed/nonexpressed threshold for all genes?
If so what would that threshold be? Say for example if a threshold of 5 FPKM was considered reasonable is there a way of coming up with a number that would be equivalent to this in the rlog transformed data? Could I use log2(5) ?
No, the same threshold for all genes would not be a good idea. The rlog is similar to log2(K_ij/s_j) (see paper). But the count K_ij is proportional to gene length, sequence content of the gene etc, in experiment-specific and sample-specific ways (which we don't understand that well yet).
But at the least you would want to correct for gene length and GC content. See the cqn and EDASeq packages.