DESeq: relaxing gene filtering thresholds increases mean-Sd correlation, but recovers many important genes
1
0
Entering edit mode
Ndimensional ▴ 20
@vlaufer-14169
Last seen 17 months ago
United States

I'm writing to ask how to balance two conflicting goals.

On the one hand, in order to satisfy the conditions for the Gauss-Markov theorem, I want to eliminate correlation between mean and sd.

On the other hand, if I back off and relax the filters a bit, I see genes that are well known to be involved in glioblastoma - in fact some of the genes that show up in "flaggedGenes" (i.e. sd > 3) are actually used to stratify pediatric brain cancer. Makes sense - their highly variable across the people, but not necessarily super highly expressed overall.

The filters that are posted here and there - in the vignette, in various help topics here on biostars, all provide numbers that are typical of "garden variety" bulk RNA-seq studies. For example, a post that Michael Love wrote, here, he provides the following block of code:

dds <- estimateSizeFactors(dds)
nc <- counts(dds, normalized=TRUE)
filter <- rowSums(nc >= 10) >= 2
dds <- dds[filter,]

so, genes that have a normalized count of 10 or greater in at least 2 samples are retained. In the context of study with n=6, that makes sense, but I have >1000 bulk RNA samples in this analysis, so that filter would be excessively lax. e.g. filter <- rowSums(nc >= 10) >= 50. The thing is, if 49 of those patients stack in the treatment or naive group, Id potentially be discarding a gene that I have near perfect statistical power to detect.

In fact, the list of flagged genes when I retain as many as 35,000 transcripts contains MOSTLY glioblastoma oncogenes, which is stunning. I think its possible these might be showing up in this way because they are definitive of various GBM endophenotypes.. so reluctant to discard them until I have proven or disproven that exhaustively.

As such, I think part of the "magic" here, so to speak, might be on devising some custom filters that provide more realistic values for a cohort of this size. Does anyone have advice on how I might do this? rowVars() comes to mind, but I'll entertain any idea with sound theoretical or empiric support. One idea that ATpoint gave me already was to use filterByExpr() from edgeR, I am trying this presently, but here the issue is lack of experience with having used to makes me unsure what to try for arguments to the function.

Thanks for your time and help!

metaanalysis meanSd genefilter DESeq2 • 834 views
ADD COMMENT
0
Entering edit mode

Update: So far, this is performing fairly well:

filter<-filterByExpr(counts(procDds), group=colData(procDds)$Treatment, min.count = 10, min.prop = 0.2)
edgyDds<-procDds[filter,]; veggie<-vst(edgyDds); mm<-meanSdPlot(assay(veggie)); flaggedGenes<-which(mm$py > 3); flaggedGenes
ADD REPLY
1
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Your comment above looks reasonably.

Given that you may very well have bimodal expression within condition, I think careful examination of the mean and variance is a good approach. High variance within the GBM may indicate important biological variability.

ADD COMMENT
0
Entering edit mode

thanks mike. going to go with this. on to pathways :-)

ADD REPLY

Login before adding your answer.

Traffic: 472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6