Load libraries

Question

Importing RSEM processed data already formatted as a summarized experiment into DESeq2

0

Entering edit mode

Ndimensional ▴ 20

@vlaufer-14169

Last seen 21 months ago

United States

Hello,

I have a fairly simple question that I know has been addressed many times: I want to import RSEM data into DESeq2 for modeling and DE. For reproducibility, the workflow is:

Unfortunately, it is costing me an inordinate amount of time, and I cannot find a perfectly analogous example either in any vignette (for DESeq2, tximport, tximeta, tximportData, etc) or in other posts.

Load libraries

libsNeeded<-c( "cBioPortalData", "DESeq2", "Tximport", "SummarizedExperiment")
BiocManager::install(libsNeeded, update = TRUE, ask = FALSE)
lapply(libsNeeded, library, character.only = TRUE)

Get a MultiAssayExperiment using cBioPortal

cbio <- cBioPortal()
getStudies(api = cbio)
studies <- getStudies(cbio, buildReport = TRUE)
CptacMae<-cBioDataPack("brain_cptac_2020", ask=FALSE)
CptacRnaSeqSE<-experiments(BrainCptac2020Mae)$mrna_seq_v2_rsem

Look at resulting MAE object

CptacMae # Returns, a MultiAssayExperiment with 6 listed experiments
CptacRnaSeqSE # returns a Summarized Experiment, and:
assays(CptacMae)$mrna_seq_v2_rsem # returns a DataFrame of 18,000 rows and 188 cols. 
row.names(CptacRnaSeqSE) # returns HUGO gene symbols - so this is **not** transcript level info
all(row.names(colData(CptacRnaSeqSE) ) == colnames(assays(CptacRnaSeqSE)[[1]])) # TRUE

Determine prior processing of Rna-Seq data by reading the metadata from cBioPortal:

cancer_study_identifier: brain_cptac_2020

stable_id: rna_seq_v2_mrna

profile_name: RNA expression

profile_description: Log2 of FPKM + 1 expression values. STAR v2.6.1d aligned + RSEM v1.3.1 quantification.

genetic_alteration_type: MRNA_EXPRESSION

datatype: CONTINUOUS

show_profile_in_analysis_tab: false

data_filename: data_mrna_seq_v2_rsem.txt

Of course, I cannot load these data directly with DESeqDataSetFromMatrix() because the values are decimal and it wouldn't be appropriate.

Using RSEM with tximport and DESeq2 provides a workflow using:

library(tximeta)
se <- tximeta(ColData, type="rsem", txIn=FALSE, txOut=FALSE, skipMeta=TRUE)
assays(se)$length[ assays(se)$length == 0] <- NA # set these as missing

but trying this on my data returns:

tximeta(CptacRnaSeqSE, type = "rsem", txIn = FALSE, txOut = FALSE, skipMeta = TRUE)

all(c("files", "names") %in% names(coldata)) is not TRUE

I do not have files, I am working from a SE directly. This would seem like it should be the easiest scenario on paper - it's already in the native language in an appropriate data object - but its proving to be more difficult than just downloading flat files would have been ...

What am I missing here? Thank you!

tximport tximportData DESeq2 • 2.0k views

ADD COMMENT • link 22 months ago • updated 21 months ago Ndimensional ▴ 20

0

Entering edit mode

Update: changes made by dev team for cBioPortalData package just a few days ago detailed below. This will likely resolve the issues mentioned by those who have responded to the post:

I've incorporated information from SAMPLE_ID from the datasets to map and build SummarizedExperiment objects. Now, you should get an object that looks like the following:

mae <- cBioDataPack("brain_cptac_2020")
A MultiAssayExperiment object of 7 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 7:
 [1] cna: SummarizedExperiment with 19380 rows and 190 columns
 [2] linear_cna: SummarizedExperiment with 19380 rows and 190 columns
 [3] mrna_seq_v2_rsem_zscores_ref_all_samples: SummarizedExperiment with 18209 rows and 188 columns
 [4] mrna_seq_v2_rsem: SummarizedExperiment with 18209 rows and 188 columns
 [5] mutations: RaggedExperiment with 9951 rows and 200 columns
 [6] protein_quantification_zscores: SummarizedExperiment with 6429 rows and 218 columns
 [7] protein_quantification: SummarizedExperiment with 6429 rows and 218 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

These changes are in the latest version of cBioPortalData in Bioc-devel (package version 2.13.4).

ADD REPLY • link 21 months ago Ndimensional ▴ 20

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

A DESeqDataSet is a SummarizedExperiment with a few extra slots. You should be able to do something like

se <- as(se, "DESeqDataSet")
design(se) <- <put correct design here>

Maybe you need to do other things, but your code is non-functional and I can't find the data you are using, so cannot tell you any more.

ADD COMMENT • link 22 months ago James W. MacDonald 68k

0

Entering edit mode

the code runs as is on my machine and on a fresh instance of R v4.3 on a remote machine - are you using an older version of R?

ADD REPLY • link 22 months ago Ndimensional ▴ 20

0

Entering edit mode

Yes, of course I need to do additional things, as the post itself clearly implies. Thank you for your help.

ADD REPLY • link 22 months ago Ndimensional ▴ 20

0

Entering edit mode

https://www.cbioportal.org/study/summary?id=brain_cptac_2020

https://google.gprivate.com/search.php?search?q=cptac+2020+brain

and the other 1.7M hits that are returned from a cut and paste query

ADD REPLY • link 22 months ago Ndimensional ▴ 20

score 2 · Accepted Answer · 2023-07-03

2

Entering edit mode

ATpoint ★ 4.8k

@atpoint-13662

Last seen 2 days ago

Germany

profile_description: Log2 of FPKM + 1 expression values. STAR v2.6.1d aligned + RSEM v1.3.1 quantification.

If the data is indeed log2(fpkm) then it is not suited for DESeq2 anyway. Note that the code does not run as such, there is no package called cBioPortal, tximport must be lowercase "t" and CptacMae<-cBioDataPack("brain_cptac_2020", ask=FALSE) throws an error for me that this studyId does not exist. Anyway, tximport requires the rsem output, you cannot run this with the data you have, but if it was raw gene level estimates you could round to nearest integer, this was suggested here on the support site on multiple occasions.

ADD COMMENT • link 22 months ago ATpoint ★ 4.8k

0

Entering edit mode

thanks I emended biocportal to biocportal data.

ADD REPLY • link 22 months ago Ndimensional ▴ 20

0

Entering edit mode

that data object does indeed exist both on the biocPortal website and in the studies object i mention above, row 30.

ADD REPLY • link 22 months ago Ndimensional ▴ 20

0

Entering edit mode

thx - if a mod moves this to an answer Ill accept it. dont know how i missed this probably just too used to looking at non log transformed RSEM data to have noticed

VAL

ADD REPLY • link 22 months ago Ndimensional ▴ 20