Hi,
I have two questions regarding the current dataset (output is coming from htseq) I am analyzing (I included the R code below). My experiment has two-time points (2hrs and 5hrs) and for each time point, there are two conditions (mock and treated). I want to compare the differential expression patterns compared to mock between different time points.
- Currently, I am doing the analysis for each time point and comparing between time points separately. Is there a way to combine both into a single run (Please note that I have mocks for both time points)?
- This may be a stupid question. When I defined the class I put TM (treated) first. The question is when I put TM first the output I am getting is compared to mock (reference) right? not the other way around. I always get scared thinking I am doing the analysis the opposite way.
Thank you in advance!
###R code###
# Read in count matrix
rawdata=read.table("/scratch/user/UPR/RNA_allign/RNA_ALIGN/htseq_counts_clean/gene_read_counts_table_all_2hr_final.tsv", header=TRUE, stringsAsFactors=FALSE, row.names=1)
# Check dimensions
dim(rawdata)
# Require at least 1/6 of samples to have expressed count >= 10
sample_cutoff <- (1/6)
count_cutoff <- 10
#Define a function to calculate the fraction of values expressed above the count cutoff
getFE <- function(data,count_cutoff){
FE <- (sum(data >= count_cutoff)/length(data))
return(FE)
}
#Apply the function to all genes, and filter out genes not meeting the sample cutoff
fraction_expressed <- apply(rawdata, 1, getFE, count_cutoff)
keep <- which(fraction_expressed >= sample_cutoff)
rawdata <- rawdata[keep,]
# Check dimensions again to see effect of filtering
dim(rawdata)
#################
# Running edgeR #
#################
# load edgeR
library('edgeR')
# make class labels
class <- c( rep("TM",3), rep("Mock",3) )
#Get common gene names
#Gene=rownames(rawdata)
#Symbol=mapping[Gene,1]
#gene_annotations=cbind(Gene,Symbol)
# Make DGEList object #y <- DGEList(counts=rawdata, genes=gene_annotations, group=class) genes component is removed
y <- DGEList(counts=rawdata, group=class)
nrow(y)
# TMM Normalization
y <- calcNormFactors(y)
# Estimate dispersion
y <- estimateCommonDisp(y, verbose=TRUE)
y <- estimateTagwiseDisp(y)
# Differential expression test
et <- exactTest(y)
# Extract raw counts to add back onto DE results
counts <- getCounts(y)
# Print number of up/down significant genes at FDR = 0.05 significance level
summary(de <- decideTestsDGE(et, adjust.method="BH", p=.05))
#Get output with BH-adjusted FDR values - all genes, any p-value, unsorted
out <- topTags(et, n = "Inf", adjust.method="BH", sort.by="none", p.value=1)$table
#Add raw counts back onto results for convenience (make sure sort and total number of elements allows proper join)
out2 <- cbind(out, counts)
#Limit to significantly DE genes
out3 <- out2[as.logical(de),]
# Order by p-value
o <- order(et$table$PValue[as.logical(de)],decreasing=FALSE)
out4 <- out3[o,]
# Save table
write.table(out4, file="DE_genes_2hr.txt", quote=FALSE, row.names=TRUE, sep="\t")
Thank you, James 🙏. I was focusing on case studies and somehow missed that. I will get back if I come across any issues.
Hi James, I did my analysis as you & Gordon Smyth suggested using the following script. Thank you again. I have a small question regarding the identification of genes among both time points. After analyzing each time point (at p=0.05, logFC >= 1), I already know that there are 42 common genes. My question is why It didn't work when I tried below (was following page 38). It ended up giving the error "incorrect number of dimensions". (Only difference associated with mine compared to the example is that I don't use 0 hr mock/treatment).
I really appreciate it if you can shed some insight about it.