Hello everyone
I have multiple bulk RNA-seq datasets that I need to apply the same pipe line on. I want to normalize them from counts data to TPM. In all datasets, I have the genes as rows, and samples as columns.
Unfortunately, I don't have the fastq files, all I have is the counts data and from there I need to get to TPM, so I cant use salmon, kallisto or any pckage that uses fastq files for that matter. To normalize, I need the length of the genes.
This is my code to get the lengths of the genes:
exons = exonsBy(EnsDb.Hsapiens.v86, by="gene")
exons = reduce(exons)
len = sum(width(exons))
insect = intersect(rownames(counts_df),names(len))
geneLengths = len[insect]
counts_df = counts_df[insect,]
And then I use the geneLengths
in the normalization process:
rpkm <- apply(X = subset(counts_df),
MARGIN = 2,
FUN = function(x) {
10^9 * x / geneLengths / sum(as.numeric(x))
})
TPM <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()
However, this gives me a long list of warnings like this:
Warning messages:
1: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
2: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
3: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
4: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
5: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
6: In 10^9 * x/genelength :
longer object length is not a multiple of shorter object length
I have been told that edgeR
has the ability to normalize, so I tried this also:
y <- DGEList(counts=counts_df , genes=data.frame(Length=geneLengths))
y <- calcNormFactors(y)
RPKM <- rpkm(y)
But this crashed pretty quickly, might be due to the big number of genes in a dataframe (more than 20,000 genes).
I have also been told the Deseq2
can normalize the data, but as far as I know Deseq2
normalizes in it's own way. Can Deseq2 nromalize to TPM data?
If not, can you guys please guide me in the normalization process and the code for that in r? there seems to be a bit of mess because without fastq files I can't know which isoform of each gene is being calculated.
Neither edgeR nor DESeq2 output TPM directly. TPM is easy to calculate though if counts and appropriate length info is available (DESeq2: Is it possible to convert read counts to expression values via TPM and return these values?). Gene length is not so trivial to get though (how to calculate gene length to be used in rpkm() in edgeR) but people told you about this already over at biostars (https://www.biostars.org/p/9534546/).