Entering edit mode
Mark Lawson
▴
10
@mark-lawson-5484
Last seen 10.2 years ago
Hello Bioconductor Gurus!
(I apologize if this goes through more than once)
We are currently using limma (through the voom() function) to analyze
RNA-seq data, represented as RSEM counts. We currently have 246
samples
(including replicates) and our design matrix has 65 columns.
My question is in regard to how much we should be filtering our data
before
running it through the analysis pipeline. Our current approach is to
look
for a CPM of greater than 2 in at least half of the samples. The code
is:
keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2)
This brings down our transcript count from 73,761 to less than 20,000.
While we do see groupings and batch effects we expect to see in the
MDS
plots, we are afraid we might be filtering too severely.
So finally my question: What is a good metric for determining how well
we
have filtered the data?
Thank you,
Mark Lawson, PhD
[[alternative HTML version deleted]]