Hi all, I'm trying to normalise my bulk RNAseq dataset for both GC-content and gene-length, as both are causing bias in my data. I have worked through the EDAseq vignette, which helps with doing one of these at a time, but not doing both simultaneously. The code I have used for normalising just for GC-content is below. Having read through the methodology used by EDAseq I can't see any reason why I can't just sequentially do this for GC-content first and then gene-length, but would be great to here if others agree this is acceptable, or if anyone has any experience of doing this. I have also included some code below to show how I would do this.
# Normalise for GC-content:
data <- newSeqExpressionSet(counts = cm,
featureData = features,
phenoData = md)
dataWithin20 <- withinLaneNormalization(data,"gc", which="full", num.bins = 20)
dataNorm20 <- betweenLaneNormalization(dataWithin20, which="full", round = T)
# Second suggested steps to normalise for gene-length:
# STEP 1 - generate a new SeqExpressionSet using the normalised counts from above as if they were unadjusted counts:
normCounts20 <- dataNorm20@assayData$normalizedCounts
data2 <- newSeqExpressionSet(counts = normCounts20,
featureData = feature,
phenoData = md)
# STEP 2 - re-run EDAseq for length normalisation:
dataWithin20.2 <- withinLaneNormalization(data2,"length", which="full", num.bins = 20)
dataNorm20 <- betweenLaneNormalization(dataWithin20.2, which="full", round = T)
Of note, CQN offers the option to do both GC-content and gene-length normalisation at the same time, but would be keen to do this with EDAseq.
Any help is much appreciated!
Sam
What is the end goal of the analysis? What you do here is quite uncommon if you ask me. Preprocessing tools such as salmon do a good job removing sample-specific bias. Directly correcting GC content is relatively uncommon if you ask me.
This is a great question. I am performing an unsupervised clustering analysis on a large RNAseq dataset, which combines samples from multiple experiments. We have found a significant batch-effect, which we have pinpointed to some element of the library preparation phase of the experiments. There is a significant GC-content bias between the batches on both a sample-level and batch-level. I am exploring multiple ways of addressing this issue, including supervised batch-correction approaches (e.g. ComBat-seq), but also explicitly correcting for GC-content bias, in the hope this will remove some of the batch-effect.
I have not explored alterations to the alignment and quantification stages of the pipeline, but this is something I am also planning on addressing.
Can the batch even meaningfully be corrected? Like, is every experimental group represented in each batch? GC bias is the least of your problems in confounded experiments. Clustering samples across many experiments is, in my opinion, not really meaningful.
Thanks for your thoughts. I get your concern about batches being corrected, but disagree with the opinion that meaningful comparisons cannot be made. There is definitely evidence to show that correcting for technical covariates can help with extracting the meaningful information within the data, particularly when you have experimental groups represented well within each batch. It would be a shame to throw away so much data for this reason - if results can be externally validated then this essentially provides evidence that these results have meaning.