Dear All,
I would like to ask a very specific question about data transformation and
the appropriate comparison of groups of samples in gene expression data. In detail, based on
raw RNA-Seq data gene counts, I implemented the VST transformation from the DESeq2 R
package, for various clustering methodologies, and I ended up with a specific group of
genes, that show expression patterns that separate interestingly my samples into groups of
studied phenotype, based on heatmap plots.
My next goal is to perform some complementary pairwise boxplots of some specific pre-
defined cluster groups, based on a subset of these genes, in order to provide some extra
evidence of a significant difference in the relative expression of each gene in these groups.
Thus, my questions are the following:
1) Are VST transformed RNA-Seq counts appropriate for the creation of additional
relative boxplots? In order to compare the groups means for each selected gene?
Moreover, for adding p-values and significance levels, a simple test in this case, like
a t-test or an ANOVA test for more than 2 groups would be fine?
2) Or, VST transformed counts are not appropriate for comparing means, even for a
very small number of genes, and I should follow a different transformation?
For example, use my matrix object of counts:
xx <- estimateSizeFactorsForMatrix(counts=matrix.count)
and afterwards use the function:
xx2 <- normTransform(xx, f = log2, pc = 1) ?
Dear Michael,
thank you very much for your answer !! Please excuse me for any misguidance and not provided enough information for my goals-so briefly, these 20 genes, have been identified and selected from another dataset with the same phenotype, based on DE analysis and feature selection. On this premise, we wanted to test on a different transcriptomic with the same disease, if these genes have any interesting biological implications in clustering patients and relative survival, which has shown good results with VST and downstream analysis. That's why, in my original thread, my collaborators asked if possible, to provide some extra figures with plots such as boxplots with the resulted clusters and -if the possible p-values-in order to further enhance the interesting patterns, that the plethora of these genes show in the relative heatmap plot of those group patients. Overall, in order to summarize:
1) Firstly, your opinion about my approach with estimateSizeFactorsForMatrix() & normTransform()
from above ? and then use a simple test for the needed boxplots? like the t.test ? as the minimum number of group samples is 30 ?
2) Concerning the implementation of DESeq2 for these genes:
my R pipeline for the creation of the VST transformed counts and patient clustering/survival was the following:
and afterwards clustering, etc..
Thus, in your opinion, if I should implement DESeq2 to inspect for DE/relative p-values regarding these genes, how I should formulate my code?
in order to have a similarity with my above approach, as also to use the same transcripts?
In detail, start with the object dat.filt, which has unique gene symbols as row names, as also been filtered?
and to create a data frame with the various cluster memberships?
For example:
But then how should I run DESeq for DE in order to inspect and isolate the pairwise adjusted p-values for these 20 genes, even not significant?
Also, any other comments or suggestions will help !!
Best,
Konstantinos
If you just want to plot and do simple testing for a few genes, I suppose there’s no problem with your first approach given you have many samples. I wouldn’t recommend performing different analyses though, but stick with a single analysis plan from the start.
Dear Michael,
thank you for your updated answer !! To summarize, in order to confirm your comment and be correct, for the first approach you pinpointed, you mean that i can use estimateSizeFactorsForMatrix() & normTransform()
for the total dataset, and then subset to the genes of interest, in order to create the relative boxplots with the these "normalized" values, correct ? as also to perform a simple test for the relative p-values, right ?
But at the same time, the cluster/grouping membership for the patients-samples, that will be used for the boxplots would be from the initial VST transformed counts, again correct ?
This is all up to you as the analyst. We have a detailed workflow and a vignette. But it's up to you exactly what steps you want to do in your analysis.
You can use any of our transformations that are described in the vignette. For all of these, you should run them over all genes, and then subset afterward.
Dear Michael,
thank you for your overall suggestions to my questions thus far. Please excuse me to return one last time, but I would like to ask you some quick important questions about the DESeq2 pipeline and the specific functions I have mentioned above, in order to use correctly the normTransform() function. In detail, my small code chunk:
So, my final questions:
1) Because my purpose of constructing the DESeq object, is for downstream analysis (clustering, boxplots etc), is it a problem that I used ~1 intersept ?
as no variable into the colData is of any interest? and I used it just for the function to run appropriately ? and won't have any effect as my purpose is not DE analysis ?
2) In the normTransform() function, you would suggest a pseudocount bigger than 1 ? like 4 ?
3) Finally, the output of normTransform function, can also be directly used-as VST transformed counts we have discussed- for hierarchical clustering ? and similar applications ?
Thank you,
Konstantinos
~1 as a design is fine for normTransform.
You can try out different pseudocounts, or just use the VST, which is our solution to avoid fiddling with pseudocounts.