Question

Validity of using rlog transformed counts (rld) to obtain a list of expressed genes

0

Entering edit mode

sjmonkley ▴ 40

@sjmonkley-8860

Last seen 17 months ago

Sweden

Dear Expert(s)

(Sorry if this information is provided elsewhere- i have looked high and low with no success)

I have a RNASeq data set of 22 samples that can be compared to one another in a number of different combinations. Some of the samples are essentially treatment subsets of others. I have used DESeq2 on the countdata ( along with associated colData) to generate both rld and dds (and res) objects. For some samples there are (biological) duplicates for others quadruplicates.

I can generate the DE results for various sample combinations using contrasts but what I also want is a list of genes that are 'expressed' in one set of samples but (for example) not another. I realise what is 'expressed' depends on how that is defined and what cutoff is used but my questions are

1) is it valid to use the rlog transformed matrix (assay(rld)) for this purpose? I am concerned that this data is subject to normalization based on the experimental design input and so would differ with different designs.

If no which package would be recommended to obtain such data from count matrix?

2) If yes to above: are there any error or p-values associated with this data? (There appears to be such data in the DESeqDataSet buty I cant work out how to get at it). The intention being I would be used as a cutoff for expression values (lfc) for which error is too high.

3) is it valid to use the mean of replicates to obtain a value for a particular cell type

4) I have tried to use the rlog function on the countdata matrix which the manual suggests you can but I get an error (Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘sizeFactors’ for signature ‘"data.frame"’).

Thanks

Sue

DESeq2 rlog • 3.1k views

ADD COMMENT • link updated 9.5 years ago by Michael Love 43k • written 9.5 years ago by sjmonkley ▴ 40

score 2 · Answer 1 · 2015-10-02

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Expressed vs not expressed is a tough question, and can't be produced directly from just any dataset as you can imagine (I would want to be certain I have saturated w.r.t sequencing depth for one thing). Here are a few suggestions though:

I could imagine using the rlog or VST transformed data to detect on/off distributions for expression if I had many samples from many different tissues types. Note that, if you have 100s of samples, the VST will be much faster than rlog.

To get an idea for this, see the Gene Expression Barcode paper (esp Fig 1):

http://nar.oxfordjournals.org/content/39/suppl_1/D1011.full

Re: question (1), I would use blind=FALSE, so the dispersion estimates takes into account the differences between tissues. It is only the global pattern of dispersion~mean which is used in the transformation, not which samples belong to which group. See the vignette for more on this distinction. For blind=FALSE, you will need to provide a DESeqDataSet, not a matrix.

Re: (2), no, it is just a transformation (like log2 or sqrt), no testing or standard errors.

Re: (3) seems reasonable to me.

Re: (4) Try providing a DESeqDataSet instead of a matrix.

ADD COMMENT • link 9.5 years ago Michael Love 43k

0

Entering edit mode

Thanks for your very helpful reply. I have 22 samples containing 3-5 distinct cell types- I would like to use this approach to help tell me how distinct. While I agree this is nothing compared to the Barcode dataset I still feel that there is value in coming up with gene lists that are considered on in one cell type and off in another.This sort of thing can be and is being done with FPKM data and I would like to do the same without having to reprocess my data

I look forward to the Barcode data based on RNASeq data- that may help in deciding on/off thresholding.

One more question that has come up when I raised this issue with colleagues: given how the rlog transformation works is it reasonable (assuming all caveats discussed above) to have the same expressed/nonexpressed threshold for all genes?

If so what would that threshold be? Say for example if a threshold of 5 FPKM was considered reasonable is there a way of coming up with a number that would be equivalent to this in the rlog transformed data? Could I use log2(5) ?

ADD REPLY • link 9.5 years ago sjmonkley ▴ 40

0

Entering edit mode

No, the same threshold for all genes would not be a good idea. The rlog is similar to log2(K_ij/s_j) (see paper). But the count K_ij is proportional to gene length, sequence content of the gene etc, in experiment-specific and sample-specific ways (which we don't understand that well yet).

But at the least you would want to correct for gene length and GC content. See the cqn and EDASeq packages.

ADD REPLY • link 9.5 years ago Michael Love 43k