Question

Strange distribution of logFC

0

Entering edit mode

Jack ▴ 20

@jack-14069

Last seen 5.4 years ago

Dear all,

I have RNAseq data of two groups. After doing differential gene expression using edgeR and DESeq2(the results are similar). I found the distribution of logFC is a little strange, having two peaks in the distribution.

My guess would be that most genes are equally enriched in both goups, i.e. the mode of the logFC distribution (probably Gaussian-like) is around 0.

Perhaps someone can tell you why this is so.

Code can be seen bolow:

P1<-read.table("P1.STAR-counts-stranded.txt",header = TRUE)
row.names(P1) <- P1[,1]
P2<-read.table("P2.STAR-counts-stranded.txt",header = TRUE)
row.names(P2) <- P2[,1]
H3<-read.table("H3.STAR-counts-stranded.txt",header = TRUE)
row.names(H3) <- H3[,1]
H4<-read.table("H4.STAR-counts-stranded.txt",header = TRUE)
row.names(H4) <- H4[,1]
LIST<-list(P1,P2,H3,H4)
COL<-unique(unlist(lapply(LIST, colnames)))
ROW<-unique(unlist(lapply(LIST, rownames)))

TOTAL<-matrix(data=NA,nrow=length(ROW), ncol = length(COL),dimnames = list(ROW,COL))

for (ls in LIST){
  TOTAL[rownames(ls),colnames(ls)]<-as.matrix(ls)
}
write.csv(TOTAL,file="/home/Claire/Desktop/data/STAR_6/TOTAL-PH-STAR-counts-stranded.csv")

#############################edgeR#####################################
setwd("/home/Claire/Desktop/data/STAR_6/")
library(edgeR)
HPcount<-read.csv("TOTAL-PH-STAR-counts-stranded.csv",header = TRUE,sep = ",")
HPcounts<-HPcount[,8:11]
samplenames<-c("P1","P2","H3","H4")
HPgeneid<-HPcount[,2]
rownames(HPcounts)<-HPgeneid
colnames(HPcounts) <- samplenames
HPgenes<-HPcount[,2:7]
group <- as.factor(c("P", "P", "H","H"))
DGEList <- DGEList(counts=HPcounts, group=group,genes=HPgenes)
DGEList$samples
keep <- rowSums(cpm(DGEList)>0.5) >= 2
DGEList_keep <- DGEList[keep , keep.lib.sizes=FALSE]
DGEList_keep_norm <- calcNormFactors(DGEList_keep)
#design matrix
design<-model.matrix(~0+group)
colnames(design)<-levels(group)
PvsH<-makeContrasts(P-H, levels = design)

#Estimating the dispersion
y<-estimateDisp(DGEList_keep_norm,design)
fit<-glmFit(y,design)
#to perform likelihood ration tests
lrt<-glmLRT(fit,contrast = PvsH)
lrt
###############################hist and volcanno plot#########
volcanoData <- cbind(lrt$table$logFC, -log10(lrt$table$PValue))
colnames(volcanoData) <- c("logFC", "negLogPval")
head(volcanoData)

plot(volcanoData, pch=19)

hist(lrt$table$logFC,breaks=100)

edger deseq2 rnaseq differential gene expression • 1.9k views

ADD COMMENT • link updated 7.4 years ago by Aaron Lun ★ 28k • written 7.4 years ago by Jack ▴ 20

0

Entering edit mode

Where is the code you used to obtain the log-fold changes and to generate the plots?

ADD REPLY • link 7.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for your reminding. The codes have been added in the question.

ADD REPLY • link 7.4 years ago Jack ▴ 20

0

Entering edit mode

Are you sure this is right? You construct design twice but you have no command to construct PvsH.

ADD REPLY • link 7.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Yes, I contructed PvsH, sorry about the negligence. This time it is right

ADD REPLY • link 7.4 years ago Jack ▴ 20

0

Entering edit mode

Well, it can't be right, because you have:

design<-model.matrix(~0+group)
PvsH<-makeContrasts(P-H, levels = design)
colnames(design)<-levels(group)

... which doesn't make sense, because the column names from model.matrix should be groupP and groupH before you rename them. So the call to makeContrasts at this location cannot be correct.

Please try running your own code in a fresh session, i.e., with no saved workspaces. Given all these errors, it is difficult to trust that the code you have shown is actually generating the output that you claim it does.

ADD REPLY • link 7.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

I wanted to simplify the codes, which made it worse, sorry about that... This is the code I rerun with no saved workspaces, without any changes. It creates the exact the same outputs.

ADD REPLY • link 7.4 years ago Jack ▴ 20

score 1 · Answer 1 · 2017-11-14

You have a very strange way of constructing your CSV file; I will just assume that HPcounts actually contains the gene-wise read counts, and not some strange coercion of gene names into integers.

If this is the case, then your log-fold change histogram simply indicates that you have lots of DE genes in both directions. The surprising aspect is not the fact that you have two modes - after all, there has to be a mode somewhere for the up- and down-regulated genes - but that there are enough DE genes on either side for the modes to be visible in the distribution of log-fold changes from all genes.

Having lots of DE genes may compromise normalization, because we need to assume that most genes are not DE in order to correct for composition biases. More than 30% of genes being DE in both directions will start to cause issues with calcNormFactors. However, this shouldn't affect the significance of the top DE genes, where the log-fold changes should be so large that any inaccuracy should be negligible.