I have a set of RNA seq data with replicates of 2 for each condition. The library sizes range from 16970950 to 36720407 (shown below).
> DGEList$samples
group lib.size norm.factors
A1 A 31271688 1
A2 A 36720407 1
B1 B 16970950 1
B2 B 23655334 1
When doing the differential gene expression analysis using edgeR, I set different filtering criteria using CPM to filter the data.
One is " keep <- rowSums(cpm(y)>0.01) >= 2". I got about 6000 DE genes, 3000 up-regulated genes and 3000 down-regulated genes.
Another is "keep <- rowSums(cpm(y)>1) >= 2". I got about 7000 DE genes, 2000 up-regulated genes and 5000 down-regulated genes. Other parameters are all the same.
With both criteria, the marker genes we are sure to be differentially expressed are all differentially expressed. It seems both criteria are good according to our marker genes.
Why the filtering criteria have so much influence on the number of differentially expressed genes?
What is a better value to filter the RNA-seq count data with count-per-million (CPM) in edgeR?
What factors should be taken into consideration when we choose the filtering criteria?
If you search this help forum, you'll find lots of advice on how to filter RNA-seq data. There's no one threshold that's guaranteed to work for all data sets.
In the workflow of edgeR I found it used "keep <- rowSums(cpm(y)>0.5) >= 2". It said "As a rule of thumb, we require that a gene have a count of at least 10–15 in at least some libraries before it is considered to be expressed in the study." Its library size is a little smaller than mine. So it means I can use a value smaller than 0.5? Is the greater criterion I choose the better results I can get?
I don't understand why the filtering criterion can have so much influence on the DE results.