Hello,
I have made great use of your DESeq2 package and grateful for it. I have recently normalized using conditional quantile normalization to remove gene length bias. This CQN normalization gives me log expression data that includes negatives. I am unable to perform the DESeqDataSetFromMatrix function to begin identifying differentially expressed genes. Is there a way to identify differentially expressed genes using log expressed data with negative values in DESeq2? Or is there an output file or step in the CQN normalization that has gene length adjusted count data that can be fed into the DESeqDataSetFromMatrix function?
Thank you for your time to help me understand this.
My original cqn code,
cqnres <- cqn(counts = counts,x = df.subset$GC,lengths = df.subset$length) # cqn normalization
CQNnorm <- cqnres$y + cqnres$offset # values are in log2
I have managed to remove the negative log values by using:
cqnOffset <- cqn_res$glm.offset
cqnNormFactors <- exp(cqnOffset)
as provided by the DESeq2 vignette. However, DESeq2 only seems to like integers, so I am working to figure that out.
You should provide integers as counts and the offsets as directed in the vignette.
I have looked at the vignette for both scripts, which is how I received the code to remove the negatives above, which was a helpful step. I still have an issue that I do not see being addressed in the vignettes and hoping to get clarification from more experienced users. From my understanding, CQN output is log-transformed and not as an integer. So if CQN output is ONLY log data and DESeq2 ONLY accepts non-negative integers, how can CQN data be performed and taken into consideration when running DESeq2? If this answer is in the vignettes, I would benefit from anyone's time to make that clear to me so I can have confidence in my next steps.
The handoff between the two packages is to use DESeq2 with the original counts and to supply the offset matrix calculated by CQN as a normalizationFactor for the dds object. This is the paradigm we have in the vignette. Is there a remaining question?
Okay, I am putting my code here for anyone who comes across this and has a similar issue. Perhaps it can offer guidance. It may not be pretty code, but she runs. Certainly, consider the DESeq2 and CQN vignettes first.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("cqn") library(cqn) library(scales)
counts <- read.csv("cavernosacount.csv", header = TRUE, stringsAsFactors = FALSE) #raw counts from species length <- read.csv("Mcavtranscriptlength.csv", header = TRUE, stringsAsFactors = FALSE) #transcript lengths of each comp(gene) gc <- read.csv("McavGC.csv", header = TRUE, stringsAsFactors = FALSE) #gc percentage as a decimal for each comp df <- merge(gc,length,by="comp") df.subset <- merge(df,counts, by="comp") #merge all to put counts, length, and gc in the same order and to subset and assign gc, length and count to comps. df.subset <- df.subset[ !duplicated(df.subset$comp), ] #remove duplicates counts <- df.subset[c(4:13)] #take only counts
Code from the author, edited to our files.
Examine relationship between FC and Tx length for biological conditions after CQN normalization with GC and Length
cqnres <- cqn(counts = counts,x = df.subset$GC,lengths = df.subset$length) # cqn normalization
CQNnorm <- cqnres$y + cqnres$offset # values are in log2 cqnplot(cqn_res, n = 2) #See how the systematic effect(length or GC) influences the LFC.the longer the gene, or higher the GC, the higher the LFC.
make the CQN normalization appropriate for DESeq2
cqnOffset <- cqnres$glm.offset cqnNormFactors <- exp(cqnOffset) CQNnormcomp <- cbind(cqnNormFactors,df.subset$comp)#annotate comp ID's to normalized count matrix colnames(CQNnormcomp)[colnames(CQNnorm_comp)==""] <- "comp" #change column label
Need to remove duplicates for DESEq
master <- read.csv("cavernosaMaster.csv") a <- merge(CQNnormcomp,master,by="comp") aev <- a[order(a$comp, abs(a$evalue) ), ] ### sort first aev <- a[ !duplicated(a$comp), ] ### Keep lowest evalue CQNnormcompdeseqinput <- aev[c(1:10)] #keep only CQN normalized data and contig ID CQNrawcount <- aev[c(1,15:24)] #keep raw counts that survived the CQN filtration
now for DESeq2
if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install() library(DESeq2) -------CQN integration-------- countData <- as.matrix(read.csv("CQNrawcount.csv", row.names="comp")) colData <- read.csv("Mcavlabel.csv")
set the design## Creates matrix by mergeing treatment to replicate id and it's HTseq read count
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~Treatment)
to integrate CQN Normalization:
normFactors <- cqnNormFactors / exp(rowMeans(log(cqnNormFactors))) normalizationFactors(dds) <- cqnNormFactors dds <- dds[ rowMeans(counts(dds)) > 10, ] #filter, #previously dds test <-DESeq(dds) #previously dds resWPvCD <- results(test,contrast=c("Treatment", "Disease", "Control")) head(resWPvCD) resorderdWPvCD <-resWPvCD[order(resWPvCD$padj),] head(resorderdWPvCD, 12) summary(resWPvCD) write.csv(resorderdWPvCD, file = "McavCQNWPvCD.csv")