I am having metagenomic data from soil samples, which were generated by a sequence capture method. That is, probes where designed based on desired genes that we wanted to capture from the micro-organisms in the samples. The reads were assembled and the contigs were functionally annotated by KEGG, thus I have a count table across the samples of contigs, a count table of Kegg Orthologies and finally one for pathways.
I decided to explore the clustering of the data with PCA plots, but since I was having count data consisting mostly of zeros, I looked for a transformation method and thus I tried rlog and vts from DESeq2. These methods couldn't be applied to the contig matrix since every contig had at least one zero in one of the samples, but this doesn't matter much because the PCA plots of KOs and especially Pathways seem to cluster the 2 soil sample groups somewhat nicely.
My problem though is that I find it challenging to figure out if these data (grouped contig counts for KOs and Pathways) are appropriate for the transformation methods of rlog and vts (being not so accustomed to statistics I though I would be okay if my data would follow a negative binomial distribution but after searching a bit more on forums I found out that this is not the case).
Dear Mike,
It looks like this for the Pathways (here are 20, in total are 91):
That looks fine for the DESeq2 transformations. The transformations offered by DESeq2 simply put the data on the log2 scale but dealing with the problem that the log of small counts has a lot of unwanted sampling variability. You can look through the DESeq2 vignette section on transformation and see which looks the best in the diagnostic plots:
https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
Dear Mike,
I think I need to mention that samples G1-G5 and W1-W5 are not treated as biological replicates in my analysis since the sample treatment varies within these to groups from sample to sample. However I would guess that this doesn't change anything since the default is BLIND = TRUE. I'm noting here that the vst transformation seems to separate the two groups better than the rlog. Is it safe to assume that the more loosely clustered matrix for the KOs (323 KOs instead of 91 Pathways in the previous matrix) is still suitable for these transformations?
Thank you!
I'll just say, the transformations are appropriate anywhere where a log transformation would be useful.
The only case in which I've seen the transformations not be useful is the combination of 0's AND very high counts within a row that I described above, and that this is the distribution for the majority of genes.