Question

Outliers and Filtering with DESEq2 to overcome NA values in padj

0

Entering edit mode

zcbthug • 0

@zcbthug-22428

Last seen 5.4 years ago

I would be grateful if anybody could please make any suggestions regarding controlling for outliers and filtering.

My import data is a large gene count table containing 598,019 bacterial genes in the rows, and three control samples compared to three disease samples ( as columns). The data can be described as a zero inflated negative binomial, here are a few lines of my data. Cols 1:3 are control, Cols 4:6 are disease.

gene
NODE1length36040cov12.43651 37 0 2 0 0 0 NODE1length36040cov12.43653 1 0 0 0 0 0 NODE2length32139cov10.31191 187 0 3 0 0 0 NODE2length32139cov10.31192 97 0 6 0 0 0 NODE2length32139cov10.31193 162 0 0 0 0 0 NODE3length27992cov10.59761 16 0 0 0 0 0

Here is the code I am using gctab <- read.csv("final.gene.count.table1.nonzero.csv", row.names=1) DF = data.frame(id=colnames(gctab),type=rep(c("ctrl","disease"),each=3)) dds = DESeqDataSetFromMatrix(gctab,DF,~type) dds <- DESeq(dds) res <- results(dds, alpha=0.05, contrast=c("type", "disease", "ctrl")) summary(res)

1) With Cooks cut off, these are my results; out of 598019 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 965, 0.16% LFC < 0 (down) : 111, 0.019% outliers [1] : 12602, 2.1% low counts [2] : 0, 0%

A lot of the p values and padj are NA as expected.

2)Turning Cooks cut off to False gets these results; out of 598019 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 3885, 0.65% LFC < 0 (down) : 1918, 0.32% outliers [1] : 0, 0% low counts [2] : 568118, 95% (mean count < 11)

A lot of padj values are now NA

I would prefer not to turn independent filtering off, would the only alternative be to change alpha ( which I wasn't too keen on doing either)? Please could somebody help in terms of suggesting ways I can overcome a lot of NAs in my padj output.

Thank-you,

deseq2 • 3.4k views

ADD COMMENT • link updated 5.4 years ago by Michael Love 43k • written 5.4 years ago by zcbthug • 0

score 0 · Answer 1 · 2019-11-21

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 19 hours ago

United States

I'd recommend to perform some minimal filtering first. It seems like right now, without any filtering, DESeq2 notices a lot of genes where there is an outlier that overly influences the LFC, and sets these p-values to NA. If you turn off the filtering, then DESeq2 nevertheless excludes these low count genes to optimize power (independent filtering). You can just set that you only want to look at genes with 3 samples with a count of X or more:

keep <- rowSums(counts(dds) >= X) >= 3
dds <- dds[keep,]
...

ADD COMMENT • link 5.4 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

Thank-you for your reply, I understand you’re very busy and do appreciate it. I can’t seem to get any help from my tutors at university at the moment. As I need to compare differentially expressed genes between columns 1:3( control) and 4:6(disease) would the code you have provided ensure that only comparisons are made between disease relative to control, if the count row sums for disease are above 5 and the count rows sums for control are above 5? If either one is not, I am assuming the gene in question will be removed from analysis. To double check would the keep code need to go after dds <- DESeq(dds) and cooks cut off will not be turned to false?

DF = data.frame(id=colnames(gctab),type=rep(c("ctrl","disease"),each=3)) dds = DESeqDataSetFromMatrix(gctab,DF,~type) dds <- DESeq(dds)

keep <- rowSums(counts(dds) >= 5) >=3 ### not sure whether >=3 would be sufficient when needing to compare disease relative to control. dds <- dds[keep, ]

res <- results(dds, alpha=0.05, contrast=c("type", "disease", "ctrl")) summary(res)

Thank-you

ADD REPLY • link 5.4 years ago zcbthug • 0

0

Entering edit mode

I'm suggesting to put this line of code above DESeq().

It will subset your dataset to genes as described in words in my post above.

ADD REPLY • link 5.4 years ago Michael Love 43k

0

Entering edit mode

In case anybody else reads this who is also trying to analyse a very zero inflated data set and observes a lot of NA values for pvalue and padj

I used this; keep <- rowSums(counts(dds) >=5) >= 3 and it still obtained NA values for a lot of pvalue and padj.

It knocked the 598,000 genes down to 17882, 1575 were upregulated, 4172 were downregulated, and 1858 were outliers.

na.omit will knock more out now too.

ADD REPLY • link 5.4 years ago zcbthug • 0