DESeq2, Still a most genes get padj=NA after declare cooksCutoff=FALSE in the results () function
1
0
Entering edit mode
colaneri ▴ 30
@colaneri-7770
Last seen 5.6 years ago
United States

Hi Im doing DGE analysis with DESeq2 to test difference in gene expression under 2 conditions. As other people reported in this foro I got a bunch of genes with NA. However in my case the NA value is in their majority for padj. (see table below) In other words I have a bunch of genes with very low pvalue but got a NA por padj.

I read in the vignette that, at least for p-values the NA represent genes with outliers count in one or more groups. I set the cooksCutoff=FALSE in the results() function but I still got the same results.

When I plot counts for one of these genes:

plotCounts(ddsIAAinCol0,"AT1G11090", intgroup = "treatment")

 

it does not seems to be an outlier 

 

Any suggestion?

 

 

  baseMean log2FoldChange lfcSE stat pvalue padj
AT5G47435 185.6364162 2.115139359 0.366914049 5.764672585 8.18E-09 1.37E-05
AT5G66055 185.3092117 1.644995185 0.293873843 5.597623681 2.17E-08 1.81E-05
AT1G11090 47.78777684 1.94133198 0.364829883 5.321197823 1.03E-07 NA
AT2G29100 15.17214648 1.666409849 0.450228259 3.701255565 0.000214535 NA
AT1G72600 13.04997183 -1.767013802 0.479422186 -3.685715542 0.000228061 NA
AT5G57290 75.34066235 -1.224003948 0.332385985 -3.682477613 0.000230978 NA
AT2G39460 33.69752389 -1.517438704 0.412598155 -3.677764148 0.000235287 NA
AT3G07050 62.78888747 -1.318916065 0.359422 -3.669547401 0.00024298 NA
AT1G10400 8.787243217 -1.793799153 0.489431764 -3.665064845 0.000247276 NA
AT1G48920 128.8451058 -1.121043683 0.306312462 -3.659804357 0.000252408 0.011708921
AT5G51690 56.90610637 1.385338028 0.380859561 3.637398583 0.000275406 NA
AT2G33060 82.5823932 -1.080766384 0.297955375 -3.627276014 0.000286427 0.012927924
AT4G16630 71.83981053 -1.231184086 0.341456276 -3.605685917 0.00031133 NA
AT5G55730 27.34342486 -1.483301796 0.412955145 -3.591919884 0.000328251 NA
AT5G65360 37.24544013 -1.324569311 0.369607195 -3.583721667 0.000338733 NA
AT1G09210 67.97424741 -1.169578499 0.327204788 -3.574454109 0.000350959 NA
AT2G18270 2119.114644 0.967748299 0.271131214 3.569298738 0.000357938 0.015522311
AT1G01160 153.078235 1.016068318 0.285036805 3.564691653 0.000364284 0.015522311
AT2G34480 103.1868193 -1.066463342 0.299624136 -3.559337231 0.000371792 0.015522311
AT4G14890 47.15691661 1.259644836 0.354034502 3.557971978 0.000373729 NA
AT4G30280 46.33922434 1.270915217 0.358566567 3.544433127 0.000393458 NA
AT1G06430 5.003252608 1.744025568 0.493321579 3.535271195 0.000407357 NA
AT2G18290 544.7909982 0.891251585 0.253231577 3.519512043 0.000432341 0.017610004
AT4G37910 42.46839645 -1.389330143 0.396446696 -3.504456353 0.000457541 NA
AT5G23420 22.05896603 -1.579334138 0.450931699 -3.502379943 0.000461122 NA
AT4G14320 54.99615495 -1.230414257 0.351421993 -3.501244323 0.000463091 NA
AT1G29310 10.29744763 -1.675254517 0.479076434 -3.496841832 0.000470801 NA
AT3G15960 26.15442415 -1.598265985 0.45751882 -3.493333858 0.00047703 NA
AT3G62600 51.08063801 -1.271109756 0.364746638 -3.48491151 0.0004923 NA
deseq2 NA cooksCutoff • 4.8k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 13 hours ago
United States
Please see the FAQ in the vignette.
ADD COMMENT
0
Entering edit mode

Dear Michael, according to the vignette the automatic independent filtering is setting the adjusted p values to NA because it considerer that those genes have a low mean normalized count. I found that only 1600 genes were considered to have a good number of counts. When I look at the data I found that 3000 genes with counts between 20  to 70 are considered low. It is that right? is a value of 20 a low mean normalized count?

I understand that if I remove the if I remove the independent filtering (independentFiltering=FALSE)) I have not more NA, but then I'm worry about where differences found in genes with mean normalized count between 20 to 70 can be trusted. Can you give me some insight?

I will appreciate your help

 

ADD REPLY
0
Entering edit mode

The threshold is chosen using the genefilter package. You can read more about the strategy in the citation for the genefilter package, or the section of the DESeq2 paper which discusses it.

Briefly: it is not that the genes with mean normalized count less than the data-driven threshold cannot be trusted.
"Trusted" implies that there would be a problem with specificity / false positives / Type I error if we included them. That is not the case. You can filter or not, and this won't affect the specificity (we have a plot showing the uniformity of p-values for various thresholds in the DESeq2 paper).

The reason for filtering low count genes in DESeq2 is to increase sensitivity for the dataset as a whole (i.e. reduces false negatives / Type II error). 

However, it's up to you in the end as the data analyst. You can use the filtering or you can set independentFiltering=FALSE and remove the genes with low counts, or keep all the genes with at least a single count per row, etc.

ADD REPLY

Login before adding your answer.

Traffic: 561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6