We have RNA-Seq data from different body parts; oral and aboral parts of small, medium and large sized specimens. We have mapped the reads using tophat2 and run cuffdiff and DESeq2 (with HTSeq count). Using the csDendro
function in cummeRbund the samples cluster largely by oral/aboral parts, but clustering a distance matrix of rlog-values from DESeq2 the samples group mainly by size.
DESeq2-commands:
#Heatmap of sample-to-sample distances distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13), main = "Sample-to-sample distances (rlog)")
I guess the main difference between the two approaches is that fpkm-values are the basis of csDendro, while rlog-transformed raw counts are used for the DESeq2 approach? But I could not find out whether all genes are included in the csDendro
function? For the DESeq2-approach we excluded genes with zero counts in all samples, otherwise all genes should be included.
When we create PCA-plots using both packages they are roughly similar. Hvave anyone have experienced similar results? What can be the explanation for these differences?
yes. I find PCA plots more useful than dendrograms. In the PCA plot, there is another dimension for separating samples, and also there is the problem Wolfgang mentions about horizontal ordering of leaves in the dendrogram.