Filtering for genes >100 counts after LRT still showing significant genes with genes with 0 counts!
1
0
Entering edit mode
A ▴ 40
@a-14337
Last seen 13 months ago
United Kingdom

Hi all, 

Just wondering if somebody can chime in on a funny problem I am having. I am filtering genes that have a minimum count of 100 to show after an LRT test to look for changes over time. This succesfully removes low count genes in the summary(results), however, when I make a count matrix of significantly expressed genes based on dds that has the filter applied (100 counts), my count matrix contains genes that have counts of 0 and above.. not a minimum of 100 counts! Please see code attached. Any input would be helpful! Many thanks.

LRT:

ddsWT <- DESeq(ddsWT, test = "LRT", reduced = ~1)
resultsNames(ddsWT)
results(ddsWT)
ddsWT<-ddsWT[rowSums(counts(ddsWT))>100]
nrow(ddsWT).... 20337 (filter successfully applied)
summary(results(ddsWT))... Low counts removed

Count matrix:

res_LRTWT<-results(ddsWT)
padj.cutoff<-0.05
sig_res_LRTWT <- subset(res_LRTWT, padj < padj.cutoff)
sigLRTWT <- rownames(sig_res_LRTWT)
length(sigLRT_genesdeannaWT).... Lower gene number than not applying filter of minimum genes!

ordered_sig_res_LRTWT <- sigLRTWT[order(sigLRTWT$padj), ]
clustering_sig_genesWT <- data.frame(ordered_sig_res_LRTWT[1:14938,])
rld_WT<-assay(ddsWT) 
cluster_rlogWT <- rld_WT[rownames(clustering_sig_genesWT),].... dim(cluster_rlogWT).. 14938 genes
 

Output (part of)

ENSMUSG00000029814      6     14      7    19      0      0      2      0      2
ENSMUSG00000091955    636    671    648   728    157     61     93    163    103
ENSMUSG00000026556   1052    923   1099  1134    449    293    469    501    443
ENSMUSG00000074968    274    171    256   243    733    565   1022    931    919
ENSMUSG00000039735   3590   3662   2995  3740    673    289    490    553    363
ENSMUSG00000022096   3359   1385   2704  1853   4420   2462   4135   3988   2898
ENSMUSG00000022610    231    126    216   168   1516    819   1407   1109    846
ENSMUSG00000019194   2684   3473   2219  2532   8965   6534   8271   9671   6609
ENSMUSG00000025950   1601   1609   1933  1659    517    295    473    520    457
ENSMUSG00000024426   1817   1453   1618  1446    850    552    746    869    618
ENSMUSG00000020889   1223   1095   1317  1229   1896   1231   2348   2622   2006
ENSMUSG00000031934    430    474    495   566     72     31     66     63     62
ENSMUSG00000031073      0      4      1     2      0      1      0      0      0
ENSMUSG00000028832   5384   3770   5434  5127   2812   1145   1759   2351   2152

There are many lines like the ones highlighted in bold, but I have filtered for genes of min 100 counts... Any input would be much appreciated. Many thanks!

 

 

LRT CPM lowcountgenes deseq2 • 1.7k views
ADD COMMENT
0
Entering edit mode

Thank you very much!! And yes, I also carried out the following code before running DESeq, where I also add conditions for how many samples should contain this minimum CPM count

filter <- rowSums(nc >= 100) >= 2
ddsWT<-ddsWT[filter,]
nrow(ddsWT)

Thanks again!

ADD REPLY
0
Entering edit mode

It's up to you, but I typically use a much lower filter if I do any filtering, e.g. normalized count of 10.

ADD REPLY
0
Entering edit mode

yep! I apologise, actually that was arbitrary... And I am looking at 1-10 as the filter. The reason I went up to 100 is because when I was filtering any count above 0 (i.e. 10), I was seeing so many 0's in the count matrix after the LRT, so increased it to 100 just to see what happens when I have a higher filter.

 

One further question actually (I did ask this in another post but just wanted to clarify).. If i want to see how genes change in time only, I have  question about which design to use as I am unsure. To look for differences in time only.. Is there a difference to having a full design of design =~Genotype + Time +Genotype:Time

or a full design of ~Genotype+Time

and then reduced of ~Genotype

Does including the interactor of Genotype+Time actually make a difference if i only want to look at differences in time only? If so how and why are the two full designs different apart from the interactor

ADD REPLY
1
Entering edit mode

The easiest way to find genes that change in time only would be to specify full=~genotype + time and reduced=~genotype. You can perform test="LRT" to obtain a gene list.

ADD REPLY
0
Entering edit mode

Many thanks!! Just out of interest though, just so I understand the full design itself, would I get the same list of DE genes if my full design is ~Genotype + Time +Genotype:Time and reduced genotype

as full design of ~Genotype + Time and reduced genotype?

 

ADD REPLY
2
Entering edit mode

No, the first set of designs you have allows for genotypic-specific time changes, whereas the second set of designs has a single time profile with only shifts due to genotype at baseline. The list will definitely be different, although some overlap.

You could get a gene in the significant list for the first set of designs which only changes over time for one genotype. Whereas for the second set of designs, the genes will be more often showing changes over time for all genotypes.

This kind of an analysis choice is up to you.

ADD REPLY
0
Entering edit mode

Amazing! Thank you so much for making that crystal clear! 

 

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 9 hours ago
United States

This step above needs a comma so you filter by rows.

ddsWT <- ddsWT[ rowSums(counts(ddsWT))>100 ]

you should use code that looks like:

dds <- dds[ keep, ]

Also, to save time you can filter before running DESeq().

ADD COMMENT

Login before adding your answer.

Traffic: 600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6