Hi,
Time to tackle an amateur's problem! Ha ha! I have 6 samples/ groups with 3 replicates (In total 17; one sample contains 2 replicates). the sample names are Mf, Bf, Ef, Mr, Br, and Er. I want to get DEGs (differentially expressed genes) between Mr-Mf, Br-Bf, Er-Ef.
For doing that, I subsetted my data keeping either Mr-Mf, Br-Bf, or Er-Ef. And then I got DEGs for three different contrasts. But subsetting the data has an impact on dispersion estimation.
Now I would like not to subset my data (keeping all my data for dispersion estimation). And after estimating dispersion, I want to have DEGs for those three different contrasts. The order I want follow is - after importing my data > dispersion estimation > getting DEGs for three different contrasts in three different file. But I can not do that.
Could you please help me modifying my script so that I can follow my oder of analysis? Note: I do not want to use normalization steps from edgeR, since I already normalized my data.
Thank you in advance for dedicating your time to resolve me problem!
Code should be placed in three backticks as shown below
library(limma)
data <- read.csv("normalized_file_for_mesophyll_&_epidermis.csv", header = TRUE, row.names = 1)
# Since the data is already normalized, there is no specific step for normalization
# Extract Mf and Mr samples.
Mf_samples <- colnames(data)[grep("^Mf", colnames(data))]
Mr_samples <- colnames(data)[grep("^Mr", colnames(data))]
# Subset data to keep only Mf and Mr samples. But now I do not want to subset my data. I want to have all the data.
subset_data <- data[, c(Mf_samples, Mr_samples)]
# Create DGEList object
dge_data <- DGEList(counts = subset_data)
# Assign group information to samples
sample_groups <- factor(c(rep("Mf", length(Mf_samples)), rep("Mr", length(Mr_samples))))
# Perform normalization
# dge_data <- calcNormFactors(dge_data)
# Perform differential expression analysis
design_matrix <- model.matrix(~0 + sample_groups) #to make a design matrix first
colnames(design_matrix) <- gsub("^sample_groups", "", colnames(design_matrix))
# To creates a contrast matrix (cm) comparing the expression levels between Mr and Mf
cm <- makeContrasts(Mr-Mf, levels=design_matrix)
dge_data <- estimateDisp(dge_data, design = design_matrix)
dge_data <- estimateDisp(dge_data, design_matrix, robust=T)
# To create biological coefficient of variation
plotBCV(dge_data)
fit.dge <- glmFit(dge_data, design_matrix)
# perform gene selection
max.qval <- 0.05
min.FC <- log2(2)
min.count <- 96
comparison <- "Mr - Mf"
lrt <- glmLRT(fit.dge, contrast=cm[,comparison])
sel.r <- topTags(lrt, n=nrow(dge_data$counts))
med.expr <- apply(dge_data$counts[rownames(sel.r),],1,median)
sel.r <- sel.r[sel.r$table$FDR<max.qval & abs(sel.r$table$logFC)>min.FC & med.expr>min.count,]
sel.rows <- rownames(sel.r) #to get name of the genes that pass the criteria
# To get the FDR, logFC, logCPM of DEGs
DEGs <- sel.r$table[sel.rows, ]
# Save the results to a file
write.csv(DEGs, "DEGs_of_mesophyll.csv")
Hello Gordon,
Thanks for your response. By normalization, I meant that I have already removed lowly expressed genes from my count matrix. I was trying to follow one of the user guide case studies, but I was confused about how can I make my design matrix and DGEList while not using a subset of my data.
Filtering (removing lowly expressed genes) is separate to normalization.
It isn't clear why not subsetting would cause any difficulty. It would be best for you to try to follow the documentation.