Hello,
I have a fairly simple question that I know has been addressed many times: I want to import RSEM data into DESeq2 for modeling and DE. For reproducibility, the workflow is:
Unfortunately, it is costing me an inordinate amount of time, and I cannot find a perfectly analogous example either in any vignette (for DESeq2, tximport, tximeta, tximportData, etc) or in other posts.
Load libraries
libsNeeded<-c( "cBioPortalData", "DESeq2", "Tximport", "SummarizedExperiment")
BiocManager::install(libsNeeded, update = TRUE, ask = FALSE)
lapply(libsNeeded, library, character.only = TRUE)
Get a MultiAssayExperiment using cBioPortal
cbio <- cBioPortal()
getStudies(api = cbio)
studies <- getStudies(cbio, buildReport = TRUE)
CptacMae<-cBioDataPack("brain_cptac_2020", ask=FALSE)
CptacRnaSeqSE<-experiments(BrainCptac2020Mae)$mrna_seq_v2_rsem
Look at resulting MAE object
CptacMae # Returns, a MultiAssayExperiment with 6 listed experiments
CptacRnaSeqSE # returns a Summarized Experiment, and:
assays(CptacMae)$mrna_seq_v2_rsem # returns a DataFrame of 18,000 rows and 188 cols.
row.names(CptacRnaSeqSE) # returns HUGO gene symbols - so this is **not** transcript level info
all(row.names(colData(CptacRnaSeqSE) ) == colnames(assays(CptacRnaSeqSE)[[1]])) # TRUE
Determine prior processing of Rna-Seq data by reading the metadata from cBioPortal:
cancer_study_identifier: brain_cptac_2020
stable_id: rna_seq_v2_mrna
profile_name: RNA expression
profile_description: Log2 of FPKM + 1 expression values. STAR v2.6.1d aligned + RSEM v1.3.1 quantification.
genetic_alteration_type: MRNA_EXPRESSION
datatype: CONTINUOUS
show_profile_in_analysis_tab: false
data_filename: data_mrna_seq_v2_rsem.txt
Of course, I cannot load these data directly with DESeqDataSetFromMatrix()
because the values are decimal and it wouldn't be appropriate.
Using RSEM with tximport and DESeq2 provides a workflow using:
library(tximeta)
se <- tximeta(ColData, type="rsem", txIn=FALSE, txOut=FALSE, skipMeta=TRUE)
assays(se)$length[ assays(se)$length == 0] <- NA # set these as missing
but trying this on my data returns:
tximeta(CptacRnaSeqSE, type = "rsem", txIn = FALSE, txOut = FALSE, skipMeta = TRUE)
all(c("files", "names") %in% names(coldata)) is not TRUE
I do not have files, I am working from a SE directly. This would seem like it should be the easiest scenario on paper - it's already in the native language in an appropriate data object - but its proving to be more difficult than just downloading flat files would have been ...
What am I missing here? Thank you!
Update: changes made by dev team for cBioPortalData package just a few days ago detailed below. This will likely resolve the issues mentioned by those who have responded to the post:
I've incorporated information from SAMPLE_ID from the datasets to map and build SummarizedExperiment objects. Now, you should get an object that looks like the following:
These changes are in the latest version of cBioPortalData in Bioc-devel (package version 2.13.4).