Dear All,
I'm working on a genomic plateform and I'm in charge of analysing biological data. A team of researchers asked me to compare two different methods to compute differentially expressed genes in a given dataset, from illumina human RNAseq data, aligned with Tophat2.
My dataset in composed of 12 samples, divided in 4 groups A, B, C and D (each of the groups containing 3 samples).
The first approach consists in using Cufflinks to compute FPKM values, calculate the mean FPKM value for each group, and then comparing the groups by performing a t-test (only if the coverage is > 1 in at least the 3 members of one of the 2 compared groups). So for each gene in each pairwise comparison, I get a p-value and a fold-change corresponding to the ratio of the means. A cutoff of p-value < 0.05 and FC > 1.5 was then applied.
The second approach consists in using the RUVseq method (http://www.bioconductor.org/packages/release/bioc/html/RUVSeq.html), based on a GLM approach. I'm using RUV-g with 11 housekeeping genes and I used the model :
set<-RUVg(set_raw_counts,row.names(controls),k=4)
where set_raw_counts is the SeqExpressionSet of the raw counts and controls contains housekeeping genes.
designAD<-model.matrix(~0+groups+W_1+W_2+W_3+W_4,data=pData(set))
yAD<-DGEList(counts=counts(set_raw_counts),group=groups)
yAD<-calcNormFactors(yAD,method="upperquartile")
yAD<-estimateGLMCommonDisp(yAD,designAD)
yAD<-estimateGLMTagwiseDisp(yAD,designAD)
fitAD<-glmFit(yAD,designAD)
Then for computing the DE genes in A vs B :
A_vs_B<-makeContrasts(groupsA-groupsB,levels=designAD)
myContrasts<-makeContrasts(A_vs_B=groupsA-groupsB,levels=designAD)
lrt<-glmLRT(fitAD,contrast=myContrasts[,"A_vs_B"])
But I'm not sure to understand well the output in lrt$table. I got 3 columns, logFC, logCPM and p-value.
I'm sorry this is a recurrent question but how is calculated the logFC and the logCPM ? Is it possible to have the details of the calculation ? How can I make it comparable with the FC I got from the first approach (with FPKM ?) Because when I tried to convert the logFC to FC, this lead to FC with very different orders of magnitude from the FPKM ones.
Thank you very much for your time and help.
By default, a prior count of 0.125 is added to all counts before the log-fold change is calculated; see the
glmFit
documentation for more details. In the case you've presented above, though, I wouldn't expect the prior count to be responsible for that result. I'd need more details to determine why that gene has such a large log-fold change. One possibility might be if there are very big differences in the library sizes between the samples in the two groups.Ok I'm not sure of what prior count really means... I need to have the 2 values that are used to calculate the logFC in my output but I just can't figured out how I can find them, any ideas ?
Thank you !
In the example you've given above, the "2 values" for all genes are in the columns of
fitAD$coefficients
with namesgroupsA
andgroupsB
, where each row corresponds to a gene. I can't imagine why you would need them, though.As to your other question, the prior count is just a count that is added to all other counts. This avoids getting, e.g., infinite fold changes if one of the groups has zero counts in all libraries.