Hi,
I am analyzing mRNA-Seq dataset using EdgeR
package and testing filtering by filterByExpr
and rowSums
that would keep genes. I have question about interpreting Histogram of average log2 CPM in EdgeR?
I tested filtering in 4 different ways, and would like to know how to interpret the plot? Basically, filterByExpr
looks good, however, I am interested in creating model.matrix on other variables too like treatment, severity, etc., for comparisons. How to do I decide the cut-off in perhaps rowsums
? What does the negative values in the x-axis signifies? Should the graph look like bell shaped distribution?
Thank you in advance.
Best Regards, Toufiq
dge <- DGEList(counts = Counts, remove.zeros = TRUE)
dge$samples
# Either;
## filterByExpr
keep <- filterByExpr(dge, design). ## ## Pairing and blocking is essential for comparison as different cells are extracted from same subjects
table(keep.keep)
## (OR)
## Filtering to remove low counts
keep <- rowSums(dge$counts) >= 10
## (OR)
## Filtering to remove low counts
<- rowSums(dge$counts) >= 50
dge <- dge[keep, , keep.lib.sizes=FALSE]
dge$counts
dim(dge$counts)
AveLogCPM <- aveLogCPM(dge)
hist(AveLogCPM)
Thank you,
Toufiq
Hi Gordon Smyth , thank you for the details and suggestions.
rowSums(Counts)
is easy to understand and execute. I like performing withfilterByExpr(y, design)
. The only doubt I have here is about the input design matrix.I used
design_1
toestimateDisp
andglmQLFit
. Pairing and blocking: I used as it was essential for comparison as different cells are extracted from same subjects.To filter by expression should I use the below design_2?
Yes, it would be better to use
design_2
for filtering even though the full matrixdesign_1
is used for the DE analysis. The reason is thatSubject
is just a blocking variable, the aim is not to compare the different Subjects to each other.Gordon Smyth Noted.
Another question, In the same experiment I have perhaps an interesting group comparison
Treatment
and considered it as independent, where 3 subjects without treatment (act as baseline), and 3 patients with treated.I create another object
DGEList
to filter by expression and fitting the model, I could use the below I assume:You can't arbitrarily change design matrices for the same experiment. You can't include
Cell_Type
for one analysis and ignore it for another. The design matrix must always include all the important factors and groups.Anyay, I think I have already answered your original question about AveLogCPM histograms.
Gordon Smyth Sure, thank you.