I would be grateful if anybody could please make any suggestions regarding controlling for outliers and filtering.
My import data is a large gene count table containing 598,019 bacterial genes in the rows, and three control samples compared to three disease samples ( as columns). The data can be described as a zero inflated negative binomial, here are a few lines of my data. Cols 1:3 are control, Cols 4:6 are disease.
gene
NODE1length36040cov12.43651 37 0 2 0 0 0
NODE1length36040cov12.43653 1 0 0 0 0 0
NODE2length32139cov10.31191 187 0 3 0 0 0
NODE2length32139cov10.31192 97 0 6 0 0 0
NODE2length32139cov10.31193 162 0 0 0 0 0
NODE3length27992cov10.59761 16 0 0 0 0 0
Here is the code I am using gctab <- read.csv("final.gene.count.table1.nonzero.csv", row.names=1) DF = data.frame(id=colnames(gctab),type=rep(c("ctrl","disease"),each=3)) dds = DESeqDataSetFromMatrix(gctab,DF,~type) dds <- DESeq(dds) res <- results(dds, alpha=0.05, contrast=c("type", "disease", "ctrl")) summary(res)
1) With Cooks cut off, these are my results; out of 598019 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 965, 0.16% LFC < 0 (down) : 111, 0.019% outliers [1] : 12602, 2.1% low counts [2] : 0, 0%
A lot of the p values and padj are NA as expected.
2)Turning Cooks cut off to False gets these results; out of 598019 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 3885, 0.65% LFC < 0 (down) : 1918, 0.32% outliers [1] : 0, 0% low counts [2] : 568118, 95% (mean count < 11)
A lot of padj values are now NA
I would prefer not to turn independent filtering off, would the only alternative be to change alpha ( which I wasn't too keen on doing either)? Please could somebody help in terms of suggesting ways I can overcome a lot of NAs in my padj output.
Thank-you,
Hi Michael,
Thank-you for your reply, I understand you’re very busy and do appreciate it. I can’t seem to get any help from my tutors at university at the moment. As I need to compare differentially expressed genes between columns 1:3( control) and 4:6(disease) would the code you have provided ensure that only comparisons are made between disease relative to control, if the count row sums for disease are above 5 and the count rows sums for control are above 5? If either one is not, I am assuming the gene in question will be removed from analysis. To double check would the keep code need to go after dds <- DESeq(dds) and cooks cut off will not be turned to false?
DF = data.frame(id=colnames(gctab),type=rep(c("ctrl","disease"),each=3)) dds = DESeqDataSetFromMatrix(gctab,DF,~type) dds <- DESeq(dds)
keep <- rowSums(counts(dds) >= 5) >=3 ### not sure whether >=3 would be sufficient when needing to compare disease relative to control. dds <- dds[keep, ]
res <- results(dds, alpha=0.05, contrast=c("type", "disease", "ctrl")) summary(res)
Thank-you
I'm suggesting to put this line of code above DESeq().
It will subset your dataset to genes as described in words in my post above.
In case anybody else reads this who is also trying to analyse a very zero inflated data set and observes a lot of NA values for pvalue and padj
I used this; keep <- rowSums(counts(dds) >=5) >= 3 and it still obtained NA values for a lot of pvalue and padj.
It knocked the 598,000 genes down to 17882, 1575 were upregulated, 4172 were downregulated, and 1858 were outliers.
na.omit will knock more out now too.