is our edgeR code doing what we think it does?
2
1
Entering edit mode
acs1990 ▴ 10
@acs1990-21798
Last seen 5.0 years ago

Thanks in advance for any help someone more experienced with edgeR and DE analysis can offer.

I am new in a lab doing this kind of work and I have only been working in R for about a month. My team tells me this code performs differential expression analysis of two groups via a t-test and edgeR's Exact Test. We also report both sets of results in presentations of the analysis. This was my first concern. In the interest of responsible analysis I compared our code to the edgeR documentation and vignettes and I noticed a handful of what I would consider serious divergences from the suggested workflow. I am worried we are not doing the analysis we think we are doing or at least we are not doing it in the most efficient or statistically responsible way. Am I on track by being concerned or is this some other equally valid way to do this type of analysis?

Would anyone be willing to do me a serious solid and check out the code and let me know where you see problems? Thanks much!

selected_samples=c("Abc_04","Abc_05","Abc_06","Abc_07","Abc_08","Abc_09")
metafile=("metadata.txt")
sample_list <- read.table(metafile, header=TRUE, stringsAsFactors=FALSE)
samples <- sample_list[order(sample_list$Sample_ID),]

#sort by ID to organize data
#import counts from CountMerged.txt data from Step3-2
counts_file=("CountMerged.txt")
counts <- read.table(counts_file, header=TRUE, sep ="\t", stringsAsFactors = FALSE)

#filter "counts" and "samples" to match properly
counts <- counts[,c("id",selected_samples)]
samples=subset(samples, Sample_ID %in% selected_samples)

row.names(counts) <- counts$id
counts$id <- NULL
colnames(counts) <- samples$Sample_ID

#gene filtering only keep samples with >=10 counts in about 1/3 of the data
keep <- rowSums(counts >= 10) >= 3
counts <- counts[keep,]

#Function definition to be called later
tmm <- function(counts, groups=NA){
  d <- DGEList(counts=counts, group=factor(groups))
  d <- calcNormFactors(d, method="TMM") 
  return(d)
}

#normalize
dgList = tmm(counts, samples$Group)

#Reset library sizes by multiplying lib.size and norm.factors
dgList$samples$lib.size <- colSums(dgList$counts)


samples[samples$Group == "CONTROL","dm"] <- "1"
samples[samples$Group == "CASE","dm"] <- "2"
samples$dm

#create a matrix of data to do more testing and apply 
#Fisher's exact test on
design <- model.matrix(~0+samples$dm)
colnames(design) <- levels(samples$dm)
design

#dg will be our dispersion estimate on the data set is the exact test calling
dg <- estimateDisp(dgList,design)
et <- exactTest(dg, pair=2:1)

#capture results in a variable
results_edgeR <- topTags(et, n = nrow(dgList$counts), sort.by = "none")

#CPM, Count Per Million
#log2 - transform the data and convert to CPM
d1 <- cpm(dgList, normalized.lib.sizes=TRUE,log=TRUE)

#Group1 sample ids (control should go here if using)
ctrl <- samples[samples$Group == "CONTROL","Sample_ID"]

#Group2 sample ids (case should go here if only case/control)
case <- samples[samples$Group == "CASE","Sample_ID"]

#T-test.
test_fun <- function(x) {
  t.test(x[ctrl], x[case], alternative = c("two.sided"), paired = FALSE, var.equal=TRUE)$p.value
}

#call T-test function
all.p.genes = apply(d1, 1, test_fun)

#calculate FDR (adjusted p value in this case)
fdr.value = p.adjust(all.p.genes, method="fdr")

#Fold Change calculation
ctrl_mean <- rowMeans(d1[,ctrl])
grp_mean <- rowMeans(d1[,case])
fc.value <-  grp_mean-ctrl_mean

#Combine results 
fc.plus.pvals = cbind(fc.value,all.p.genes, fdr.value)
edger differential expression rna-seq • 1.5k views
ADD COMMENT
2
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 1 day ago
San Diego

Doing T-tests on logged CPM data...I guess if you really have nothing but Excel, it will give you an idea..but this is not how you are supposed to do it. You can't publish based on results like this. You should follow what the EdgeR tutorials say, and let that software's algorithms do their stuff. You give EdgeR raw counts, it does its normalization, and it calculates fold changes and p-values.

Here's a hint as to why T-test aren't right...The t-test result between (1,2,3) and (4,5,6) is the same as (10,20,30) and (40,50,60)...but in real life, with those raw counts, the latter is way more meaningful.

CPM normalization has its uses for visualization...but it's not appropriate for calculating DE genes.

Since you have only two categories, it looks like the exact test is appropriate. Your script should just stop at that point, and not bother with t-test on normalized counts.

ADD COMMENT
0
Entering edit mode

Why is making the DGEList and calculating size factors in its own function? dgList$samples$lib.size <- colSums(dgList$counts) doesn't seem to do anything. I don't think the code after is doing much either.

ADD REPLY
0
Entering edit mode

Thank you for your feedback, I appreciate you taking the time to review this. You have absolutely confirmed my worst fears and it's super helpful. I feel better since you confirmed some of what I noticed too and I am working on overhauling our workflow to adhere to the documentation.

ADD REPLY
0
Entering edit mode

Thank you for your help on this. I am in the process of addressing this with my team and making sure we are doing it right.

To clarify, you should log transform the CPM for visualization only and not for analysis and only use the raw counts for analysis steps like the exactTest and QLFtest?

It is unclear from the User Guide since the Guide suggests filtering on CPM but doesn't show transforming the raw counts to CPM anywhere. The results of the QLF test also show logCPM. None of the examples show raw counts being transformed to CPM prior to filtering or normalization so my question is, should counts be transformed to CPM before creation of the DGEList object or before analysis or is this transformation performed as part of the function? Thanks again.

ADD REPLY
2
Entering edit mode
Yunshun Chen ▴ 900
@yunshun-chen-5451
Last seen 5 days ago
Australia

You should always refer to the edgeR user's guide. There are several detailed case studies in the user's guide with complete R code and explanation which you can easily follow.

We usually recommend the quasi-likelihood approach for differential expression analysis as we found it gives better FDR control.

ADD COMMENT
0
Entering edit mode

Thank you so much. I really appreciate how detailed and clear the documentation is, especially the case studies. Even as an R noob I have been able to follow it very easily and the QL approach was the conclusion I came to as well. I am in the process of overhauling our workflow so that it more closely adheres to the documentation.

ADD REPLY

Login before adding your answer.

Traffic: 867 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6