Hi, I am doing a paired analysis between 24 pairs of Normal and Diseased samples, using edgeR. I take the entries with FDR < 0.05. Out of a total of around 57700 genes (including the multiple isoforms), I have obtained around 11500 genes within my selected FDR. However, among these, I find that about 8000 are up-regulated in diseased condition. This shows that my data is biased towards a particular set. Furthermore, when looking at the log(fc), the upregulated genes have a max value of 7, while in case of downregulated genes, there are only a couple with values 4, and the majority around 2 and below.
I would also like to state here, that the genes that I have found differential (both up and down) do indeed seem to be correct, as there are journals in which the top differential genes (in both categories) have been reported.
I have noticed similar results when doing the analysis between various sets. In each case, my upregulated genes / transcripts/miRNAs dominate.
I would like some clarification as to whether I am doing something wrong or what the reason possibly could be for this.
I am providing the script I have run below, as well as few lines of my input file: (I have considered the genes which have a minimum of 4 reads, in either all normal or all diseased samples)
Input file:
gene_name gene_type length R1_R1_71N R2_R1_71D R3_R1_79N R4_R1_79D R5_R1_85N R6_R1_85D ...
A1BG protein_coding 8321 113 106 96 71 97 73
A1BG-AS1 antisense 7432 115 106 95 85 90 79
Script:
raw_counts <- read.table(file.choose(),header = TRUE, sep = "\t")
head(raw_counts)
group <- factor(c("56","56","43","43","62","62","134","134","135","135","58","58","67","67","59","59","48","48","159","159","127","127","76","76"..total 24 pairs))
condition <- c("N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D","N","D"...total 24 pairs)
y <- DGEList(counts = raw_counts[,4:51], genes = raw_counts[,1:3], remove.zeros = TRUE)
dim(y)
y_norm <- calcNormFactors(y)
keep <-rowSums(y$counts)>= 96
table(keep)
y_norm <- y_norm[keep, ,keep.lib.sizes=FALSE]
dim(y_norm)
y_norm
df <- data.frame(Sample = colnames(y_norm),group,condition)
df
design <- model.matrix(~group + condition)
design
rownames(design) = colnames(y_norm)
design
y_norm <- estimateDisp(y_norm,design,robust=TRUE)
y_norm$common.dispersion
fit <- glmQLFit(y_norm, design)
lrt <- glmQLFTest(fit)
result <- topTags(lrt, n=197684, adjust.method="BH", sort.by="none", p.value=1)
rpkm <- rpkm(y_norm)
What makes you say that there are more genes up-regulated in the disease state than down-regulated? You don't show any results that would suggest that.
It's certainly not true that edgeR tends to give positive logFCs regardless of what is tested. The signs of the logFCs will in fact just reverse if you change the order of the levels of the condition factor.
What I meant was that from the results, out of all the genes with FDR < 0.05, majority genes had a negative value of log(fc), which in this case means upregulated in disease.
Very few had positive values of log(fc). (Downregulated)
I have provided the number of total differential genes, and the no of upregulated and downregulated genes, at the start of my question.
Unfortunately I cannot disclose the actual results here.