Perform correlation analysis between miRNA and mRNA gene expression data on the same TCGA dataset based on the curatedTCGAData
1
2
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

briefly, based on a previously identified gene signature in a specific type of cancer (through gene expression analysis), in parallel i found also 4 specific microRNAs (mature miRs) that regulate a specific subset of my signature (~18 genes) via experimentally validated databases. Now, as i final step i would like to explore in the TCGA COAD dataset, the expression of the miRs and the relative expression of these genes in the same patients, to investigate any kind of significant and negative correlation, which would confirm further my notion-

from a quick search, i found that the curatedTCGAData R package contains various assays for various types of TCGA data, including the cancer of interest, and from a small query:

curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)

Please see the list below for available cohorts and assays
Available Cancer codes:
 ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
 KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
 PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM 
Available Data Types:
 CNACGH CNASeq CNASNP CNVSNP GISTICA GISTICT
 Methylation miRNAArray miRNASeqGene mRNAArray
 Mutation RNASeq2GeneNorm RNASeqGene RPPAArray 

Thus:

A) with COAD, which data types should i select ? in order to have only the miRNA expression and the RNASeq expression data ?

i see that there are miRNAArray, miRNASeqGene, RNASeq2GeneNorm, RNASeqGene and mRNAArray-however i dont know the specific differences, as i have used data mostly from the GDC server-my notion is that both types of expression should be normalized and/or transformed into the same way, for the correlation analysis to be appropriate

B) Moreover, how i could subset both assays, based on specific miRs and specific gene symbols simultaneously ?

Any suggestions, help or idea would be essential !!

curatedTCGAData MultiAssayExperiment multiomics TCGA • 2.1k views
ADD COMMENT
2
Entering edit mode
Levi Waldron ★ 1.1k
@levi-waldron-3429
Last seen 1 day ago
CUNY Graduate School of Public Health a…

A) the drill-down process in curatedTCGAData goes something like this. The data are the last snapshot provided by TCGA Firehose, ie GDC "legacy" data ( https://confluence.broadinstitute.org/display/GDAC/FAQ ).

> library(curatedTCGAData)
> curatedTCGAData(diseaseCode = "*", assays = "*", dry.run = TRUE)
Please see the list below for available cohorts and assays
Available Cancer codes:
 ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
 KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
 PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM 
Available Data Types:
 CNACGH CNASeq CNASNP CNVSNP GISTICA GISTICT
 Methylation miRNAArray miRNASeqGene mRNAArray
 Mutation RNASeq2GeneNorm RNASeqGene RPPAArray 
> curatedTCGAData(diseaseCode = "COAD", assays = "*", dry.run = TRUE)
                                 COAD_CNASeq                                  COAD_CNASNP 
                  "COAD_CNASeq-20160128.rda"                   "COAD_CNASNP-20160128.rda" 
                                 COAD_CNVSNP                        COAD_GISTIC_AllByGene 
                  "COAD_CNVSNP-20160128.rda"         "COAD_GISTIC_AllByGene-20160128.rda" 
               COAD_GISTIC_ThresholdedByGene                            COAD_Methylation1 
"COAD_GISTIC_ThresholdedByGene-20160128.rda"     "COAD_Methylation_methyl27-20160128.rda" 
                           COAD_Methylation2                            COAD_miRNASeqGene 
   "COAD_Methylation_methyl450-20160128.rda"             "COAD_miRNASeqGene-20160128.rda" 
                              COAD_mRNAArray                                COAD_Mutation 
               "COAD_mRNAArray-20160128.rda"                 "COAD_Mutation-20160128.rda" 
                        COAD_RNASeq2GeneNorm                              COAD_RNASeqGene 
         "COAD_RNASeq2GeneNorm-20160128.rda"               "COAD_RNASeqGene-20160128.rda" 
                              COAD_RPPAArray 
               "COAD_RPPAArray-20160128.rda" 
> curatedTCGAData(diseaseCode = "COAD", assays = c("miRNASeqGene", "RNASeq2GeneNorm"), dry.run = TRUE)
                  COAD_miRNASeqGene                COAD_RNASeq2GeneNorm 
   "COAD_miRNASeqGene-20160128.rda" "COAD_RNASeq2GeneNorm-20160128.rda" 
> mae <- curatedTCGAData(diseaseCode = "COAD", assays = c("miRNASeqGene", "RNASeq2GeneNorm"), dry.run = FALSE)
>

B) This provides a MultiAssayExperiment object, which you can subset by rownames to select genes and miRNA of interest. The MultiAssayExperiment package has a cheat sheet to help with quick reference for such operations. For example:

> mae
A MultiAssayExperiment object of 2 listed
 experiments with user-defined names and respective classes. 
 Containing an ExperimentList class object of length 2: 
 [1] COAD_miRNASeqGene-20160128: SummarizedExperiment with 705 rows and 221 columns 
 [2] COAD_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 191 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices
> rownames(mae)
CharacterList of length 2
[["COAD_miRNASeqGene-20160128"]] hsa-let-7a-1 hsa-let-7a-2 hsa-let-7a-3 hsa-let-7b ... hsa-mir-98 hsa-mir-99a hsa-mir-99b
[["COAD_RNASeq2GeneNorm-20160128"]] A1BG A1CF A2BP1 A2LD1 A2ML1 A2M A4GALT ... ZYG11A ZYG11B ZYX ZZEF1 ZZZ3 psiTPTE22 tAKR
> rownames(mae[c("hsa-let-7a-1", "A1BG"), , ])
CharacterList of length 2
[["COAD_miRNASeqGene-20160128"]] hsa-let-7a-1
[["COAD_RNASeq2GeneNorm-20160128"]] A1BG
>

Note that the TCGAUtils package provides a number of other helper functions for MultiAssayExperiment objects coming from curatedTCGAData, for example, adding ranges so that you can subset by GRanges objects instead of by symbols.

ADD COMMENT
0
Entering edit mode

Dear Levi,

thank you very much for your detailed answer, suggestions and putative steps-some crusial comments:

1) The link above that you have included does not work- so it is possible to search or check which processing steps with which algorithms the assays "miRNASeqGene" and "RNASeq2GeneNorm" have been performed ?

i could only found the following link:

http://gdac.broadinstitute.org/runs/stddata_latest/samplesreport/

For which it mentions:

*mRNAseq Preprocessor

The mRNAseq preprocessor picks the "scaled_estimate" (RSEM) value from Illumina HiSeq/GA2 mRNAseq level_3 (v2) data set and makes the mRNAseq matrix with log2 transformed for the downstream analysis. If there are overlap samples between two different platforms, samples from illumina hiseq will be selected. The pipeline also creates the matrix with RPKM and log2 transform from HiSeq/GA2 mRNAseq level 3 (v1) data set.*

*miRseq Preprocessor

The miRseq preprocessor picks the "RPM" (reads per million miRNA precursor reads) from the Illumina HiSeq/GA miRseq Level_3 data set and makes the matrix with log2 transformed values.*

Thus, the miRNASeqGene are RPM log2 values ? but what about RNASeq2GeneNorm ? are RSEM values or RPKM ? Please excuse me to insist on this, but it is crusial to decide if both expression values are comparable and i could then perform a direct correlation analysis, or further transformations are necessary

2) A) Thank you also for the cheat sheet-so, the one function that i could use to subset simultaneously both datasets, would be:

rownames(mae[c("hsa-let-7a-1", "A1BG"), , ]) ? and include both gene symbols and miRs ?

B) Moreover, with the functions assays() you think that it is also necessary to subset to only common samples in both experiments ? which is also mandatory to perform my type of correlation analysis ?

ADD REPLY
1
Entering edit mode

1) I've fixed the link to the Firehose FAQ above. Your interpretation of the miRNASeqGene values seems correct to me, but I don't want to represent myself as an expert on the Firehose pipeline itself or take any chance of giving you a wrong answer there...

2) You may find wideFormat(), MatchedAssayExperiment(), and assays() from MultiAssayExperiment useful for different kinds of correlation analysis. There are some examples in the workshop I gave last year at BioC2018.

ADD REPLY
0
Entering edit mode

Thank you one more time Levi for the feedback and information-i fully understand that the nature and type of normalization is upon me and how i should proceed-the workshop link looks great, so i think that based on your functions and tutorials, subsetting and moving for downstream analysis will not be a bottleneck-i will create a new post or answer here for any specific functions related to MultiAssayExperiment

ADD REPLY

Login before adding your answer.

Traffic: 577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6