Hi all,
Sorry for the simple question but I'm new to Kallisto and could not reach a definitive conclusion based on other posts.
I have two matrices of expression data (from a previous study) that I want to use for mRNA-miRNA correlation analysis. Based on the information provided by the authors, the mRNA expression matrix is derived using Kallisto. They did not mention the use of Tximport or any other tools but the read counts are already collapsed to gene-level, resulting in a matrix with gene names in the first column (e.g. TP53, A2M), genome-coordination in the second column (e.g. "chr17_7661779_7687550_-") and raw counts in the rest of the table (which are mostly non-integer values such as 1192.75812 and 2546.19874).
My question concerns how to normalize this data for correlation analysis. I have come to the conclusion that the best methods for normalization for correlation analysis are edgeR's TMM and DESeq2 normalization. Based on the other posts, I think the following scenarios would work out but I would appreciate it if someone could confirm this and correct my mistakes:
For edgeR: apparently, edgeR is capable of dealing with non-integer values and I think there is no need for any prior data transformation since the counts are already collapsed to the gene level. so simply using the following codes should prepare the matrices for correlation analysis -->
dgelist <- DGEList(matrix) ##the matrix with gene names as row names and counts in columns
norm-mat <- calcNormFactors(dgelist, method = "TMM")
norm-mat <- cpm(norm-mat, log = TRUE)
For DESeq2: DESeq2 requires integers as input. Others suggest using the function DESeqDataSetFromTximport() but I believe this works for outputs of Tximport, not a matrix of counts. apparently, this function also accounts for gene length bias for results of Kallisto but (correct me if I'm wrong) I don't think that's needed here since the counts are already collapsed to gene level. So the only thing I need to do is to use matrix <- round(matrix), create a DESeq dataset from this, and use the function counts(deseq.dataset, normalized = TRUE).
Q1: Is it OK to input this data as-is for normalization without any prior transformation regarding length and other stuff?
Q2: Does my scenario for edgeR work out? and is it appropriate for correlation analysis?
Q3: Does my scenario for DESeq2 work out? and is it appropriate for correlation analysis?
Thanks in advance for your help