I'm writing to ask how to balance two conflicting goals.
On the one hand, in order to satisfy the conditions for the Gauss-Markov theorem, I want to eliminate correlation between mean and sd.
On the other hand, if I back off and relax the filters a bit, I see genes that are well known to be involved in glioblastoma - in fact some of the genes that show up in "flaggedGenes" (i.e. sd > 3) are actually used to stratify pediatric brain cancer. Makes sense - their highly variable across the people, but not necessarily super highly expressed overall.
The filters that are posted here and there - in the vignette, in various help topics here on biostars, all provide numbers that are typical of "garden variety" bulk RNA-seq studies. For example, a post that Michael Love wrote, here, he provides the following block of code:
dds <- estimateSizeFactors(dds)
nc <- counts(dds, normalized=TRUE)
filter <- rowSums(nc >= 10) >= 2
dds <- dds[filter,]
so, genes that have a normalized count of 10 or greater in at least 2 samples are retained. In the context of study with n=6, that makes sense, but I have >1000 bulk RNA samples in this analysis, so that filter would be excessively lax. e.g. filter <- rowSums(nc >= 10) >= 50
. The thing is, if 49 of those patients stack in the treatment or naive group, Id potentially be discarding a gene that I have near perfect statistical power to detect.
In fact, the list of flagged genes when I retain as many as 35,000 transcripts contains MOSTLY glioblastoma oncogenes, which is stunning. I think its possible these might be showing up in this way because they are definitive of various GBM endophenotypes.. so reluctant to discard them until I have proven or disproven that exhaustively.
As such, I think part of the "magic" here, so to speak, might be on devising some custom filters that provide more realistic values for a cohort of this size. Does anyone have advice on how I might do this? rowVars()
comes to mind, but I'll entertain any idea with sound theoretical or empiric support. One idea that ATpoint gave me already was to use filterByExpr() from edgeR, I am trying this presently, but here the issue is lack of experience with having used to makes me unsure what to try for arguments to the function.
Thanks for your time and help!
Update: So far, this is performing fairly well: