Just follow up on my RNAseq analysis on the TCGA RSEM processed so-called raw counts using limma-voom package as Dr. Gordon suggested due to the issue of so-called RSEM raw counts but not true raw counts that edgeR required.
edgeR will work OK if you simply round the RSEM expected counts to integers before running edgeR. However voom is slightly preferable because it can use the fractional counts as they are.
(1.) I used isexpr <- rowSums(cpm(y)>1) >= 20 (y2 filtering) to do the filtering (based on one example in the package guide). I have 29 paired of matched tumor and normal samples (total 58 samples), is this reasonble to use for filtering, I asked this last time as well? I have a bit concern, since after filtering, the number of genes went from 20531 to 14735.
Why would this be a concern? One wouldn't expect the whole genome to be expressed at worthwhile levels in each particular cell type. Some genes are specific to cell types.
(2.) I did plot with plotMDS(y2,main="y2"); v <- voom(y2,design,plot=TRUE); plotMDS(v, top=50,main...). The plotMDS all show very nice separation of tumors vs normals in dim1.
I have analysed similar data from TCGA, and the separation is really much too good. Furthermore there is little evidence of matching. It would appear that the matched normal cells are not really comparable cells to the tumours. The matched normal samples may actually be profiles of whole blood, a cell type which has a radically different expression profile to epithelial cells. I really wonder what one can hope to learn about cancer by comparing epithelial tumours to blood.
However, the v <- voom(y2,design,plot=TRUE) plot slightly different between filtering with isexpr <- rowSums(cpm(y)>1) >= 20 vs filtering with y3<-y[rowSums(y$counts)>=50,]. The first filtering way seems got similar plot as the one on page 116 of the user guide, but the second way seems got bending toward to the down side around log2(count size +0.5) between 0 to 5, and the data points also shift to 0 side due to more of lower reads data points (the first option only have very few data points between 0 to 5 in x-axis). Does this suggest the first filtering way (y2) is better?
Yes, but one will get good results either way.
(3.) the above command show(summary(de <- decideTests(fit))); showed y2 filtering got 3917 +5392=9309 DEGs at FDR5% (7661 at FDR 1%), whereas the y3 filtering (similar commands as y2, not shown) got final 4917+6022=10939 DEGs at FDR5%, both filtering seem getting too many DEGs, any issues here?
This is what I would expect to get if I compared epithelial tumour cells to normal blood. If this is "too many DEGs", then the problem is with the data rather than the analysis.
(4.) using edgeR and the same data, I got more than 13k DEGs (total 18215 genes in the dataset after filtering) with FDR <0.05 (filtering like y3), now using limma voom, I still got quite a lot of DEGs (10939 DEGs at FDR5%) as shown in my question 3, although slight less than edgeR. Any potenital issue?
The edgeR analysis on fractional counts is not correct. edgeR is a likelihood based package. It evaluates the negative binomial likelihood, which is always exactly zero at non-integers. So there is no point in comparing the edgeR results to the voom results.
Best wishes Gordon
Postscript
Some years after this answer was posted, edgeR does now support fractional counts.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.