Dear Bioconductor community,
while trying to utilize the MultiAssayExperiment data container, along with a specific queried TCGA dataset, to perform some initial filtering and data cleaning prior downstream analysis, I have encountered the following error when tried to add an additional omics layer to the MAE object:
library(MultiAssayExperiment)
library(curatedTCGAData)
library(TCGAutils)
library(UpSetR)
library(DESeq2)
curatedTCGAData(diseaseCode = "LUAD", assays = "*", dry.run = TRUE)
luad.mae<- curatedTCGAData(
diseaseCode = "LUAD",
assays = c(
"RPPAArray","Mutation",
"RNASeq2GeneNorm", # here perhaps the raw RSEM counts
"GISTIC_ThresholdedByGene"
),
dry.run = FALSE
)
rag <- "LUAD_Mutation-20160128"
genome(luad.mae[[rag]]) <- translateBuild(genome(luad.mae[[rag]]))
seqlevelsStyle(luad.mae[[rag]]) <- "UCSC"
genome(luad.mae[[rag]])
luad.updated <- qreduceTCGA(luad.mae, keep.assay = FALSE)
luad.processed <- mergeReplicates(intersectColumns(luad.updated))
tums <- TCGAsampleSelect(colnames(luad.processed), "01")
luad.final <- luad.processed[, tums, ]
luad.final
A MultiAssayExperiment object of 4 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 4:
[1] LUAD_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 181 columns
[2] LUAD_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 146 columns
[3] LUAD_RPPAArray-20160128: SummarizedExperiment with 223 rows and 181 columns
[4] LUAD_Mutation-20160128_simplified: RangedSummarizedExperiment with 22929 rows and 181 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
# Isolate the rna-seq omics assay to perform normalization
rna.seq <- getWithColData(luad.final, 2L)
count.dat <- assay(rna.seq)
pheno.dat <- colData(rna.seq)
dds <- DESeqDataSetFromMatrix(countData = count.dat,
colData = pheno.dat,
design = ~ 1)
dds.norm <- vst(dds)
# a small non-specific "intensity" filtering procedure to remove unexpressed features
NotExpressed <- apply( assay(dds.norm), MARGIN=2, function( z ){
dens <- density( z )
expr.cut <- dens$x[ which.max( dens$y )]
return( z < expr.cut )
} )
expr.ps <- rowSums( !NotExpressed ) > ( ncol( NotExpressed )/2 )
dds.filtered <- dds.norm[expr.ps, ]
dim(dds.filtered)
[1] 17374 146
sum(duplicated(rownames(assay(dds.filtered))))
[1] 0
luad2 <- c(luad.final, list(newMatrix = assay(dds.filtered)), mapFrom = "RNASeq2GeneNorm")
Error: subscript contains invalid rownames
In addition: Warning message:
Assuming column order in the data provided
matches the order in 'mapFrom' experiment(s) colnames
# alternative approach to directly replace the expression matrix instead of adding an additional layer:
luad.final[["LUAD_RNASeq2GeneNorm-20160128"]] <- assay(dds.filtered)
harmonizing input:
removing 146 sampleMap rows with 'colname' not in colnames of experiments
# but then unfortunately all the relative columns from the RNA-Seq experiment are lost
luad.final
A MultiAssayExperiment object of 4 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 4:
[1] LUAD_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 181 columns
[2] LUAD_RNASeq2GeneNorm-20160128: matrix with 17374 rows and 0 columns
[3] LUAD_RPPAArray-20160128: SummarizedExperiment with 223 rows and 181 columns
[4] LUAD_Mutation-20160128_simplified: RangedSummarizedExperiment with 22929 rows and 181 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] M3C_1.10.0 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[3] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3
[5] DESeq2_1.28.1 limma_3.44.3
[7] UpSetR_1.4.0 TCGAutils_1.8.1
[9] cBioPortalData_2.0.10 AnVIL_1.0.3
[11] dplyr_1.0.2 curatedTCGAData_1.10.1
[13] MultiAssayExperiment_1.14.0 SummarizedExperiment_1.18.2
[15] DelayedArray_0.14.1 matrixStats_0.57.0
[17] Biobase_2.48.0 GenomicRanges_1.40.0
[19] GenomeInfoDb_1.24.2 IRanges_2.22.2
[21] S4Vectors_0.26.1 BiocGenerics_0.34.0
[23] TCGAbiolinks_2.17.3 tximport_1.16.1
[25] tximportData_1.16.0
Thus:
1) Why there is an error hitting with the c function ? I have also checked and there are not duplicated row names, only a smaller subset of the initial RNA-Seq features-
2) Is there an alternative way of replacing the original RNA-Seq dataset, with the updated transformed one ? without losing any other information like sample phenotype information ?
3) Finally, could a similar replacement be performed for additional assays, like the mutations and the CNAs simultaneously ? providing only additional updated matrices with a reduced number of features ?
Overall, my ultimate goal is to use the final MultiAssayExperiment for multi-omics integration, and one putative alternative is to convert each layer to a long data frame regarding the omics matrices, and keep an additional data frame with all phenotype information...
Best,
Efstathios
Dear Marcel,
thank you very much for your answer and comments !! and please excuse me for the long code chunk, I wanted just to be certain that I did not omitted any "extra" line that might caused the error !!
I was just confused based on my previous post regarding the MAE utilization (https://support.bioconductor.org/p/134693/) regarding your comment on the c function for adding experiments, and I did not utilized the whole name abbreviation (c(coad.updated, list(newMatrix = updated.matrix), mapFrom = "RNASeq2GeneNorm"))
I've made minor changes to the documentation and examples to make this more clear in
MultiAssayExperiment
1.17.2
.