Hello all,
I'm a graduate student working on analyzing RNA-Seq data from a microbiome - many different species all found in the same environment. I used some annotation software and Python to get a list of counts for hits to each organism, within each sample. Imported these counts into R, created a combined data table, and then used DESeq2 for comparing the 12 wild-type (WT) samples with my 12 samples from diseased individuals.
When comparing between DESeq's normalized counts, AKA counts(dds, normalized=TRUE), versus the raw counts per million transcripts (CPM) for these organisms, I noticed something interesting. For some of the higher-abundance organisms, the CPM is higher in my WT samples, if only slightly. However, in DESeq2's normalized data, counts are higher in my diseased samples.
In other words, for some of the organism counts, they are being shifted down in my WT samples and up in my experimental (diseased) samples, to the point where the final calculated log2FoldChange shows these organisms as more active in my experimental samples - even though the CPM is higher in my WT samples.
I know that DESeq fits to a negative binomial distribution, but is that strong enough to lead to a reversed direction of the log2FoldChange for these entries?
Code I'm running:
> completeCondition <- data.frame(condition=factor(c(rep("control", length(control_files)), rep("experimental", length(exp_files)))))
> colnames(complete_table) <- completeCondition
> dds <- DESeqDataSetFromMatrix(complete_table, completeCondition, ~ condition)
> dds <- DESeq(dds)
> normalized_data <- counts(dds, normalized=TRUE)
Any help is greatly appreciated!
Hello Wolfgang,
Thank you for the suggestion! I haven't yet tried looking at a scatterplot with the species in question selected, but I'll attempt that to see how much it shifts in position.
Yes, I am aware that DESeq2 works off of actual counts. I am comparing the FPM matrix results with the normalized data (counts(dds)). I am not using the CPM values as input.
This is RNA-seq data, yes, but I'm pooling all gene hits to the same organism to look at changes in organism abundance.