Question

confusing output from EdgeR Glm

0

Entering edit mode

amoltej ▴ 10

@amoltej-7192

Last seen 7.6 years ago

Australia

Hi,

I am using following code to analyse two samples and differentially expressed gene in them. there are 3 replicates for each sample. after completing analysis I am seeing all the logFC values in negative, I don't know why? I actually went back and checked the raw count for them and they seem wrong. can you please tell me if I am doing anything wrong? also why pvalue is 0 (zero)? Thank you

here is the code that I am using -

#data_import
raw.data<-read.table("bcytvsgut.count",header=TRUE)

group<-c("Bcyt","Bcyt","Bcyt","mgut","mgut","mgut")
#make DGE list
cds<-DGEList(raw.data,group=group)
names(cds)
head(cds$counts)
cds$samples
#normalize data using TMM
cds <- calcNormFactors( cds )
d$samples
#data filtering
cps <- cpm(cds)
k <- rowSums(cps>=10)>3
d <- cds[k,]

#reset the library size
d$samples$lib.size <- colSums(d$counts)
cols <- as.numeric(d$samples$group)
par(mfrow=c(2,2))

plotMDS(d,col=cols)

#setup desing
design <- model.matrix(~0+group, data=d$samples)
design

d <- estimateGLMCommonDisp(d,design,verbose=T)
d <- estimateGLMTrendedDisp(d,design)
d <- estimateGLMTagwiseDisp(d,design)
plotBCV(d)

plotMeanVar(d,show.raw=TRUE, show.tagwise=TRUE, show.binned=TRUE)
#calculate differentially expressed genes
fit <- glmFit(d,design)
lrt <- glmLRT(fit)

topTags(lrt)
names(lrt)
lrt$comparison
head(lrt$table)
#export data
saveme<-topTags( lrt, n = Inf , sort.by = "none" )
write.table(saveme,file="EdgeR_bcytvsgut_hiseq.txt",sep="\t",quote=F)
plotSmear(d)
abline(h = c(-2, 2), col = "blue")

ambiguities that i am seeing between data -

Raw data -

Raw_data
G000241	1580	1726	1805	94	82	244
G000270	34849	34345	36117	50908	44831	46246
differential expression result
G000241	-16.2926	6.382334	1060.65	1.18E-232	1.20E-232
G000270	-7.90819	11.8414	5135.262	0	0

edger logfc pvalue • 2.1k views

ADD COMMENT • link updated 9.6 years ago by Aaron Lun ★ 28k • written 9.6 years ago by amoltej ▴ 10

0

Entering edit mode

Sorry the data doesn't make sense there - here is updated version of sample data

Raw_data
	Bcyt	Bcyt	Bcyt	Gut	Gut	Gut
G000241	1580	1726	1805	94	82	244
G000270	34849	34345	36117	50908	44831	46246
differential expression result
G000241	-16.2926	6.382334	1060.65	1.18E-232	1.20E-232
G000270	-7.90819	11.8414	5135.262	0	0

ADD REPLY • link 9.6 years ago amoltej ▴ 10

score 1 · Answer 1 · 2015-09-09

You're using a design matrix without an intercept. This means that the first column/coefficient represents the average log-expression across the "Bcyt" group, while the second column represents the average log-expression across the "mgut" group. By default, glmLRT will drop the last coefficient in the design matrix. This is equivalent to testing for whether the average log-expression across all "mgut" samples is equal to zero. This will not be true for the vast majority of genes, resulting in many genes with very low p-values and negative log-FC's (presumably, the average log-expression for these genes is less than zero when adjusted for library size).

I'm guessing that you actually want to test for differential expression between "Bcyt" and "mgut" groups. If so, you need to define a contrast vector. For simplicity, I'll redefine your design matrix here:

grouping <- factor(group)
design <- model.matrix(~0+grouping)
colnames(design) <- levels(grouping)

Your contrast vector can be defined with makeContrasts, as shown below:

con <- makeContrasts(Bcyt - mgut, levels=design)

This can be supplied to the contrast= argument of glmLRT to test for DE between groups. The log-fold changes in the resulting DE list represent the change in expression of the "Bcyt" group over the "mgut" group.