Dear Community,
briefly, based on a previously identified gene signature in a specific type of cancer (through gene expression analysis), in parallel i found also 4 specific microRNAs (mature miRs) that regulate a specific subset of my signature (~18 genes) via experimentally validated databases. Now, as i final step i would like to explore in the TCGA COAD dataset, the expression of the miRs and the relative expression of these genes in the same patients, to investigate any kind of significant and negative correlation, which would confirm further my notion-
from a quick search, i found that the curatedTCGAData R package contains various assays for various types of TCGA data, including the cancer of interest, and from a small query:
curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)
Please see the list below for available cohorts and assays
Available Cancer codes:
ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM
Available Data Types:
CNACGH CNASeq CNASNP CNVSNP GISTICA GISTICT
Methylation miRNAArray miRNASeqGene mRNAArray
Mutation RNASeq2GeneNorm RNASeqGene RPPAArray
Thus:
A) with COAD, which data types should i select ? in order to have only the miRNA expression and the RNASeq expression data ?
i see that there are miRNAArray, miRNASeqGene, RNASeq2GeneNorm, RNASeqGene and mRNAArray-however i dont know the specific differences, as i have used data mostly from the GDC server-my notion is that both types of expression should be normalized and/or transformed into the same way, for the correlation analysis to be appropriate
B) Moreover, how i could subset both assays, based on specific miRs and specific gene symbols simultaneously ?
Any suggestions, help or idea would be essential !!
Dear Levi,
thank you very much for your detailed answer, suggestions and putative steps-some crusial comments:
1) The link above that you have included does not work- so it is possible to search or check which processing steps with which algorithms the assays "miRNASeqGene" and "RNASeq2GeneNorm" have been performed ?
i could only found the following link:
http://gdac.broadinstitute.org/runs/stddata_latest/samplesreport/
For which it mentions:
*mRNAseq Preprocessor
The mRNAseq preprocessor picks the "scaled_estimate" (RSEM) value from Illumina HiSeq/GA2 mRNAseq level_3 (v2) data set and makes the mRNAseq matrix with log2 transformed for the downstream analysis. If there are overlap samples between two different platforms, samples from illumina hiseq will be selected. The pipeline also creates the matrix with RPKM and log2 transform from HiSeq/GA2 mRNAseq level 3 (v1) data set.*
*miRseq Preprocessor
The miRseq preprocessor picks the "RPM" (reads per million miRNA precursor reads) from the Illumina HiSeq/GA miRseq Level_3 data set and makes the matrix with log2 transformed values.*
Thus, the miRNASeqGene are RPM log2 values ? but what about RNASeq2GeneNorm ? are RSEM values or RPKM ? Please excuse me to insist on this, but it is crusial to decide if both expression values are comparable and i could then perform a direct correlation analysis, or further transformations are necessary
2) A) Thank you also for the cheat sheet-so, the one function that i could use to subset simultaneously both datasets, would be:
B) Moreover, with the functions assays() you think that it is also necessary to subset to only common samples in both experiments ? which is also mandatory to perform my type of correlation analysis ?
1) I've fixed the link to the Firehose FAQ above. Your interpretation of the miRNASeqGene values seems correct to me, but I don't want to represent myself as an expert on the Firehose pipeline itself or take any chance of giving you a wrong answer there...
2) You may find
wideFormat()
,MatchedAssayExperiment()
, andassays()
from MultiAssayExperiment useful for different kinds of correlation analysis. There are some examples in the workshop I gave last year at BioC2018.Thank you one more time Levi for the feedback and information-i fully understand that the nature and type of normalization is upon me and how i should proceed-the workshop link looks great, so i think that based on your functions and tutorials, subsetting and moving for downstream analysis will not be a bottleneck-i will create a new post or answer here for any specific functions related to MultiAssayExperiment