Question

How to normalize Kallisto count matrix collapsed at gene-level with TMM/DESeq2?

0

Entering edit mode

BioNovice247 • 0

@1a34845f

Last seen 6 months ago

France

Hi all,

Sorry for the simple question but I'm new to Kallisto and could not reach a definitive conclusion based on other posts.

I have two matrices of expression data (from a previous study) that I want to use for mRNA-miRNA correlation analysis. Based on the information provided by the authors, the mRNA expression matrix is derived using Kallisto. They did not mention the use of Tximport or any other tools but the read counts are already collapsed to gene-level, resulting in a matrix with gene names in the first column (e.g. TP53, A2M), genome-coordination in the second column (e.g. "chr17_7661779_7687550_-") and raw counts in the rest of the table (which are mostly non-integer values such as 1192.75812 and 2546.19874).

My question concerns how to normalize this data for correlation analysis. I have come to the conclusion that the best methods for normalization for correlation analysis are edgeR's TMM and DESeq2 normalization. Based on the other posts, I think the following scenarios would work out but I would appreciate it if someone could confirm this and correct my mistakes:

For edgeR: apparently, edgeR is capable of dealing with non-integer values and I think there is no need for any prior data transformation since the counts are already collapsed to the gene level. so simply using the following codes should prepare the matrices for correlation analysis -->

dgelist <- DGEList(matrix) ##the matrix with gene names as row names and counts in columns
norm-mat <- calcNormFactors(dgelist, method = "TMM")
norm-mat <- cpm(norm-mat, log = TRUE)

For DESeq2: DESeq2 requires integers as input. Others suggest using the function DESeqDataSetFromTximport() but I believe this works for outputs of Tximport, not a matrix of counts. apparently, this function also accounts for gene length bias for results of Kallisto but (correct me if I'm wrong) I don't think that's needed here since the counts are already collapsed to gene level. So the only thing I need to do is to use matrix <- round(matrix), create a DESeq dataset from this, and use the function counts(deseq.dataset, normalized = TRUE).

Q1: Is it OK to input this data as-is for normalization without any prior transformation regarding length and other stuff?

Q2: Does my scenario for edgeR work out? and is it appropriate for correlation analysis?

Q3: Does my scenario for DESeq2 work out? and is it appropriate for correlation analysis?

Thanks in advance for your help

DESeq2 tximport Kallisto TMM edgeR • 2.3k views

ADD COMMENT • link updated 3.4 years ago by Michael Love 43k • written 3.4 years ago by BioNovice247 • 0

score 2 · Accepted Answer · 2021-10-13

I don't think that's needed here since the counts are already collapsed to gene level.

Even still, the argument of the tximport paper is that differential transcript usage which can lead to changes in the gene length can be corrected using our methods. The gene-level counts would still have this bias for DE at gene level.

But if you want to ignore this DTU effect on gene length, you can just round the counts and provide to DESeqDataSetFromMatrix(). Again, this is if you don't have access to the underlying transcript-level data. Otherwise I'd recommend using tximport.

Re: correlation analysis, I'd recommend vst() in the DESeq2 world. This is for example what WGCNA recommends for what expression values to use coming out of DESeq2. vst will take into account sequencing depth (and gene length if you use the tximport pipeline).