I have early embryonic development time series (normal, mutant & treated) RNA-seq counts data from multiple studies which I am planning to use for clustering genes. I have to remove the study/batch effects for which I am using Combat-seq using study ID as batches. Then for normalization and transformation, I am using VST with blind=TRUE option. I see that mean expression of genes is no longer correlated with its variance - which is good. The thing with early embryonic development transcriptome data is that a lot of genes change in their expression levels. Given this huge changes in expression in this kind of data, I am worried about using VST with blind=TRUE option. I am kind of having a feeling that the gene dispersions are being overestimated.
Simply, I looked at the number of genes which are down-regulated from early to late time point. I got around 1600 genes having a log fold change <= -1. On the other hand, if I perform log2(CPM+0.5) normalization, the number of genes down-regulated is around 4000 or so. (log fold change <= -1). I understand that VST penalizes the low expressed genes more to reduce the noise in general. But, I am not so sure whether what I see is a huge reduction in number of genes down-regulated and I am killing lot of genes just because they are highly variable in the general embryonic development time course. Do you people think it is okay? How should I determine whether blind=TRUE is an okay option? Or should I try to do VST with blind=FALSE option? - The few information I have about these samples are Study, time point of development, Treatment. The issue is that I might have only one replicate sample for a treatment. I am not sure how to use them as covariates for the analysis. I will be happy to hear any suggestion or feedback.
Just to mention, I think my results (clustering) are better in general when I perform VST normalization than some of the other things I have tried. But I wanted to sure whether I am doing something really wrong with VST and killing some biological variance in the data.