how to understand cpm=1 and how filterByExpr works?
I have read the doc, but can not understand it easily
how to understand cpm=1 and how filterByExpr works?
I have read the doc, but can not understand it easily
Hi again,
The documentation for filterByExpr()
is clear. Please take a look: https://rdrr.io/bioc/edgeR/man/filterByExpr.html
If you have any doubt or misunderstanding, then please elaborate on that specifically (not generally).
Kevin
I found the following two articles very helpful in explaining the functions "cpm" and "filterByExpr" Article#1: From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline [version 2; peer review: 5 approved] https://f1000researchdata.s3.amazonaws.com/manuscripts/9996/37711c4a-d061-4af3-8ca0-5e8714e43c8e_8987_-_gordon_smyth_v2.pdf?doi=10.12688/f1000research.8987.2&numberOfBrowsableCollections=50&numberOfBrowsableInstitutionalCollections=4&numberOfBrowsableGateways=40
Article#2: RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR [version 3; peer review: 3 approved] https://f1000researchdata.s3.amazonaws.com/manuscripts/18347/846f3e9f-1806-4c61-8fcf-254e56340236_9005_-__matt_ritchie_v3.pdf?doi=10.12688/f1000research.9005.3&numberOfBrowsableCollections=50&numberOfBrowsableInstitutionalCollections=4&numberOfBrowsableGateways=40
P.S: I strongly agree - the documentation that accompanies functions are very often vague and hard to understand.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The filterByExpr function keeps rows that have worthwhile counts in a minumum number of samples (two samples in this case because the smallest group size is two). The function accesses the group factor contained in y in order to compute the minimum group size, but the filtering is performed independently of which sample belongs to which group so that no bias is introduced. It is recommended to recalculate the library sizes of the DGEList object after the filtering, although the downstream analysis is robust to whether this is done or not.
the sentence in the edger user guide is hard to understand, especially the bold part, did not tell why we need to do so.
Users should also filter with count-per-million (CPM) rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.
and cpm has relationship with library size, if we sequence 100M reads, and cpm=1 for egfr gene in sample A, what does cpm=1 stand for in reality but the code in the user guide does not calculate cpm, I did not see this part.
and you can see
After the filtering, the library sizes would slightly change as some genes are filtered out. It is recommended to recalculate the library sizes to get more precise values, although it won't make any noticeable difference in general.
In reality, a cpm of 1 stands for 1 count in every 1m reads. In your case, it would be equivalent to a read count of 100.
It has been taken care of in
filterByExpr
. The value of themin.count
argument infilterByExpre
(default to 10) is first converted into cpm and then used as a threshold.as the most view question, I guess most people has the same question with me